kbarbary / nestle Goto Github PK

View Code? Open in Web Editor NEW

76.0 12.0 27.0 2.44 MB

Pure Python, MIT-licensed implementation of nested sampling algorithms for evaluating Bayesian evidence.

Home Page: http://kbarbary.github.io/nestle

License: Other

Python 100.00%

python nested-sampling bayesian-inference

nestle's People

Contributors

Stargazers

Watchers

nestle's Issues

change name to matryoshka?

... or matroska?

http://en.wikipedia.org/wiki/Matryoshka_doll

Diffusive nested sampling

What about to implement diffusive nested sampling? http://arxiv.org/abs/0912.2380

RuntimeError: Negative h encountered (h=-1.4971192285884172e-08). Please report this as a likely bug.

Minimal reproduction (boiled down from my big project):

import pandas as pd
import sncosmo
from redback.get_data import get_lasair_data

data = get_lasair_data(transient="ZTF23aadchqd", transient_type='supernova')

sncosmo_data = pd.DataFrame()
sncosmo_data["time"] = data["time"]
sncosmo_data["band"] = data["band"]
sncosmo_data["flux"] = data["flux_density(mjy)"]
sncosmo_data["flux_err"] = data["flux_density_error"]
sncosmo_data["zp"] = [25] * len(data)
sncosmo_data["zpsys"] = data["system"].str.lower()

data = dict(map(lambda col_kv: (col_kv[0], list(map(lambda row_kv: row_kv[1], col_kv[1].items()))), sncosmo_data.items()))

for i in range(100):
    model = sncosmo.Model(source='snana-2004gv')
    model.set(z=0.053)

    result, fitted_model = sncosmo.nest_lc(
        data, model,
        ['t0', 'amplitude'],
        bounds={'t0': (59937.38354322212, 59937.38354322212), 'amplitude': (1.7139958333452147e-17, 1.7139958333452147e-17)})
print("DONE")

Setup before running:

pip install sncosmo bilby selenium lxml nestle iminuit openpyxl
mkdir redback_build_from_source_because_lasair_update_not_in_pip
cd redback_build_from_source_because_lasair_update_not_in_pip
git clone https://github.com/nikhil-sarin/redback
cd redback/
pip install -r requirements.txt
pip install .
cd ../..
rf -rf ./redback_build_from_source_because_lasair_update_not_in_pip

Versions:

Python: Python 3.11.8 (venv)
OS: linux 6.8.4.arch1-1
nestle: 0.2.0

Restarting the sampler

Hi,

First off, thanks for sharing a great and really useful piece of code!

I was wondering if you had experimented with being able to checkpoint and restart the sampler?

I'm currently exploring using Nestle to explore the parameter space and Bayes evidence for different parametrisations of a galaxy formation model. This model is run using MPI on HPC systems where jobs must be submitted to a queue with a wallclock limit. Since the length of time the sampler needs to run is not deterministic, I have had jobs time out after a week of running. It would be great if there was a callback for dumping a checkpoint file that could be used to restart the sampler...

Try using DBSCAN for initial clustering

In the multi-ellipsoidal clustering, clusters are created by recursively splitting points into two clusters. In some (contrived) examples, this results in strange partitioning. Example:

In this case, it would clearly be better to bound the center points in a single ellipsoid. What probably happened here is that the points were initially split into two clusters vertically, meaning that the center mode was split into two right away and so could never be reassembled.

It might be helpful to do initial clustering using an algorithm with a non-fixed number of clusters such as DBSCAN. However DBSCAN would not split curving degeneracies into multiple clusters, so we'd still need to use the current algorithm for that.

I'm not sure how big a problem this is in "real-world" problems.

In 'multi' method, allow modes to disappear gracefully

In the multi-ellipsoidal algorithm, a cluster must have a minimum of ndim+1 member points. This means that two widely separated modes will not be split into separate clusters if one mode has only a few points.

For example, in 2-d, here is how two modes are separated when the lower peak has 3 points:

On a later iteration, one of the 3 points in the lower peak is discarded due to having the lowest likelihood. At that point, the separation looks like:

Possible Solution 1: "Freeze" bounding ellipsoid for clusters that have ndim + 1 points. That ellipsoid will be used until all its points disappear. A little distatesful because it makes the ellipsoid decomposition "stateful": You can't just look at a set of points and see how the bounding ellipsoids will look - the answer depends on previous iterations.

Possible solution 2: Relax requirement of clusters having ndim + 1 points. Would expand ellipsoid dimensions to fulfill a target volume. May lead to oversplitting into too many ellipsoids.

DBSCAN for mode identification might help with this. (If DBSCAN identifies a mode, always split, even if there are <= ndim points.)

A good test case to see how big a problem this is would be two N-d gaussians of different heights.

Strange error when unsupported method name is passed to nestle.sample()

Hello,

If I pass an unsupported method to nestle.sample(), Nestle is trying to throw an exception like this:
raise ValueError("Unknown method: {:r}".format(method))

The above fails in Python3.7 with the following message:

ValueError: Unknown format code 'r' for object of type 'str'

What is :r supposed to do anyway?

More high-dimensionality tests

The test suite currently includes a few "integration tests" - full problems with known analytic evidence as a benchmark - but only in low dimensions. (Though a moderately high dimensional case is included in the examples section of the docs.)

Should add some high-dimensional problems to the tests, ideally ones that exercise the multi-ellipsoidal method. The likelihood function could be overlapping gaussians to create a curving degeneracy, which should result in multiple ellipsoids.

underflow in vol_prefactor for large dimensions

For large numbers of dimensions, vol_prefactor gives 0. due to underflow:

In [8]: nestle.vol_prefactor(450)
Out[8]: 5.73e-322

In [9]: nestle.vol_prefactor(500)
Out[9]: 0.0

One should probably not be using nestle for dimensionality this large, but it would be good to fix this anyway.

v0.1.0 to-do list

List of to-do items for release, mainly for my own benefit.

weightedcov() API: make it as similar as possible to np.cov & np.average.
Changed name to mean_and_cov() for clarity.
Result dictionary (and keys): look at scipy's OptimizeResult.
Changed logprior to logvol for clarity.
Think about callback API some more.
Single callback function that gets passed a dictionary.
ellipsoid partitioning interval
"classic" point exploration method
List of references
Example illustrating prior_transform
Implement more test cases from ellipsoidal nested sampling papers

Crash in 9 dimensional likelihood

@kbarbary I hope this crash report is somewhat helpful - after a number of iterations for this particular likelihood it always crashes. It is unclear to me why - do you have any pointers?

res_full = nestle.sample(stel_model.logprob, stel_model.transform_priors,
                         9, method='multi', npoints=100,
                         verbose=True, callback=nestle.print_progress, dlogz=1.)

it=  3981 logz=-11465.9874566656779450

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-26aa44a81021> in <module>()
      1 res_full = nestle.sample(stel_model.logprob, stel_model.transform_priors,
      2                          9, method='multi', npoints=100,
----> 3                          verbose=True, callback=nestle.print_progress, dlogz=1.)

/lustre/home/wkerzend/miniconda3/envs/starkit/lib/python2.7/site-packages/nestle.pyc in sample(loglikelihood, prior_transform, ndim, npoints, method, update_interval, npdim, maxiter, maxcall, dlogz, decline_factor, rstate, callback, queue_size, pool, **options)
   1019         # Update the sampler based on the current active points.
   1020         if since_update >= update_interval:
-> 1021             sampler.update(pointvol)
   1022             since_update = 0
   1023 

/lustre/home/wkerzend/miniconda3/envs/starkit/lib/python2.7/site-packages/nestle.pyc in update(self, pointvol)
    733     def update(self, pointvol):
    734         self.empty_queue()
--> 735         self.ells = bounding_ellipsoids(self.points, pointvol=pointvol)
    736         for ell in self.ells:
    737             ell.scale_to_vol(ell.vol * self.enlarge)

/lustre/home/wkerzend/miniconda3/envs/starkit/lib/python2.7/site-packages/nestle.pyc in bounding_ellipsoids(x, pointvol)
    509     ell = bounding_ellipsoid(x, pointvol=pointvol, minvol=True)
    510 
--> 511     return _bounding_ellipsoids(x, ell, pointvol=pointvol)
    512 
    513 

/lustre/home/wkerzend/miniconda3/envs/starkit/lib/python2.7/site-packages/nestle.pyc in _bounding_ellipsoids(x, ell, pointvol)
    450     # centroid = (2, ndim) ; label = (npoints,)
    451     # [Each entry in `label` is 0 or 1, corresponding to cluster number]
--> 452     centroid, label = kmeans2(x, k=start_ctrs, iter=10, minit='matrix')
    453 
    454     # Get points in each cluster.

/lustre/home/wkerzend/miniconda3/envs/starkit/lib/python2.7/site-packages/scipy/cluster/vq.pyc in kmeans2(data, k, iter, thresh, minit, missing, check_finite)
    638     for i in xrange(iter):
    639         # Compute the nearest neighbor for each obs using the current code book
--> 640         label = vq(data, code_book)[0]
    641         # Update the code book by computing centroids
    642         new_code_book, has_members = _vq.update_cluster_means(data, label, nc)

/lustre/home/wkerzend/miniconda3/envs/starkit/lib/python2.7/site-packages/scipy/cluster/vq.pyc in vq(obs, code_book, check_finite)
    202     """
    203     obs = _asarray_validated(obs, check_finite=check_finite)
--> 204     code_book = _asarray_validated(code_book, check_finite=check_finite)
    205     ct = np.common_type(obs, code_book)
    206 

/lustre/home/wkerzend/miniconda3/envs/starkit/lib/python2.7/site-packages/scipy/_lib/_util.pyc in _asarray_validated(a, check_finite, sparse_ok, objects_ok, mask_ok, as_inexact)
    236             raise ValueError('masked arrays are not supported')
    237     toarray = np.asarray_chkfinite if check_finite else np.asarray
--> 238     a = toarray(a)
    239     if not objects_ok:
    240         if a.dtype is np.dtype('O'):

/lustre/home/wkerzend/miniconda3/envs/starkit/lib/python2.7/site-packages/numpy/lib/function_base.pyc in asarray_chkfinite(a, dtype, order)
   1213     if a.dtype.char in typecodes['AllFloat'] and not np.isfinite(a).all():
   1214         raise ValueError(
-> 1215             "array must not contain infs or NaNs")
   1216     return a
   1217 

ValueError: array must not contain infs or NaNs```

Happy to help fix.

An simple example using parallelization function

Dear developers,

I notice nestle can support parallelization sampling now. can you give a simple example to show how to use it? I have made some attempt but failed. Below is my attempt.

from pathos.multiprocessing import ProcessingPool 
pool = ProcessingPool(nodes=4)  #use dill pickle

#below class is from nestle source code
class FakePool(object):
    """A fake Pool for serial function evaluations."""

    def __init__(self,pool):
        self.pool = pool

    def submit(self, fn, *args, **kwargs):
        return FakeFuture(fn, *args, **kwargs)

    def map(self, func, *iterables):
        return self.pool.map(func, *iterables)

    def shutdown(self):
        pass
    
class FakeFuture(object):
    """A fake Future to mimic function calls."""

    def __init__(self, fn, *args, **kwargs):
        self.fn = fn
        self.args = args
        self.kwargs = kwargs

    def result(self):
        return self.fn(*self.args, **self.kwargs)

    def cancel(self):
        return True

fakepool = FakePool(pool)


t1 = time.time()
res = nestle.sample(loglike, prior, 7, method='multi',pool=fakepool, queue_size=4,
                    npoints=25,callback=nestle.print_progress)
t2 = time.time()

Above method can run successfully, but the runing time of code do not decrease. Also, I type top command in linux shell, it seems above method do not spawn subprocess. could you give any suggestion?

Add obvious note to documentation about potential biases in different methods

User-settable parameter for stopping criterion

The nested sampling algorithm stops when the weight of the samples has been declining for some given (but arbitrary) number of iterations. Declining weight means that the likelihood of new samples is increasing more slowly than the prior volume is decreasing, indicating that the likelihood is flattening out.
Currently this is set to it//6 iterations, where it is the current number of iterations.

While this seems reasonable, it would be nice to be able to tweak this parameter or think of a better parameterization for the stopping criterion. It may be better to stop when declining for a set number of iterations relative to number of points rather than number of iterations.

How to visualize intermediate steps in Nestle

We would like to "make a movie" of how the iterations arrive to the solution.
We can run of course many times the program. But we actually we need to
access are res.sample and res.weights (or something similar) during run time.

Try refining clusters

In the multi-ellipsoidal algorithm, points are split into clusters recursively. At each split, K-means with K=2 is used to cluster the points.

In Feroz, Hobson & Bridges (2009), the algorithm further refines cluster membership after the K-means split so as to minimize the total volume of the two bounding ellipsoids. I have some code to do this in this gist. This would be plugged into the bounding_ellipsoids() function. However, I haven't put it in yet because it is unclear how necessary this is and whether it is worth the computational overhead.

To do: try the refinement procedure out, particularly on some higher-dimensional problems, and benchmark.

Unbiased estimator for volume

There’s some discussion in the stats lit that exp(-i/n) could be replaced by (1 - 1/n)^i, as the latter is the minimum variance unbiased estimator

There are other improvements for reducing bias by randomising stopping criteria in 1412.6368, but I wasn’t convinced by the proposals. What do you think? In practice the difference may be small.

https://arxiv.org/pdf/1412.6368.pdf
https://arxiv.org/abs/1805.03924

Adaptive update interval

It can be expensive to determine bounding ellipsoids, particuarly for the multi-ellipsoidal method. However, this doesn't need to be done every iteration. The update_interval parameter lets the user set how often the ellipsoidal decomposition is performed.

In theory, fewer updates mean less efficient sampling and more likelihood evaluations. Thus, the ideal choice for this parameter depends on the runtime of the likelihood function versus the runtime of the ellipsoidal decomposition. It should be possible to time the execution of the likelihood function while sampling in order to adaptively adjust the update interval.

However, this might depend on having a good understanding of the effect on the sampling efficiency, as measuring the sampling efficiency directly might give results too noisy to be useful.

kbarbary / nestle Goto Github PK

nestle's People

Contributors

Stargazers

Watchers

Forkers

nestle's Issues

Recommend Projects

Recommend Topics

Recommend Org