icb-dcm / pyabc Goto Github PK

View Code? Open in Web Editor NEW

201.0 6.0 42.0 47.48 MB

distributed, likelihood-free inference

Home Page: https://pyabc.rtfd.io

License: BSD 3-Clause "New" or "Revised" License

Python 97.45% HTML 0.87% Shell 0.14% Mako 0.05% CSS 1.50%

abc approximate-bayesian-inference likelihood-free-inference parameter-inference

pyabc's People

Contributors

Stargazers

Watchers

Forkers

kbarnhart omarjasser daneseanna pat-laub emadalamoudi fbkarsdorp attrna j-zimmermann dubesar charleshouston yannikschaelte kevinlh1991 nbungi syssynbio chrhck lcdvissc libuchauer kurhula havaldar poldrack basnijholt aratz fbergmann sebgoti laxmandahal sefahf passion4energy lutzbrusch jdieg0 jonathanteuffel stephanmg marc-vaisband murodin vgurev zatyrus stefanradev93 sebnaze lm1909 martino-vic omsai bob88965 syzoekao

pyabc's Issues

Fix vector graphics in docs

Apparently, building the docs fails due to vector graphics inclusion in latex compilation no longer being supported on the platform.

StochasticAcceptor: Log scale

Allow to have everything on a log-scale and only move to linear scale at the last point, after summing, normalizing etc.

This helps avoid numerical issues.

When working on the population, e.g. via update_distances() or to_dict(), make sure we do not run into difficulties from referencing. Therefore, the thing to do should be that when the population data are changed, this happens only on a copied population, which is then returned. Here, the copying should go just as deep as components are changed.

Normalization of weights

Before being saved into the database, the weights of particles are normalized s.t. all particles belonging to one model have a summed weight of 1. However, there are multiple problems:

When there are multiple simulations per particle (as encoded in population_strategy.py()), the sum of the normalized weights of all summary statistics (e.g. obtained via history.get_weighted_sum_stats()) will in general not sum to 1. This must be taken into account when e.g. computing effective sample sizes in transition.multivariatenormal.py
Only the normalized weights and the model probabilities are stored in the database. Thus, the original weights cannot be reproduced afterward, unless also e.g. the total model weights are stored.

Solution:

When retrieving e.g. weights + sum stats, always renormalize to sum = 1, or make sure in all using methods that they check w = w/sum(w). Then the analyses should be right, I guess.
Probably doesn't matter, since we won't need that information, or possibly can retrieve it via the transition kernel.

Maybe it would also make sense to not normalize the weights that are inserted into the database at all. Then it should be easiest to always apply the correct normalization depending on what's needed.

Redis server cannot discriminate between different runs

In https://github.com/ICB-DCM/pyABC/blob/master/pyabc/sampler/redis_eps/sampler.py, the keys from https://github.com/ICB-DCM/pyABC/blob/master/pyabc/sampler/redis_eps/cmd.py are used. These are the same for abc runs running in parallel, so that they get mixed up. I.e. a server can currently only be used for one abc run at a time. This problem came up several times recently.

To solve this, one could prefix all keys in redis with a uuid (https://docs.python.org/3.7/library/uuid.html), either per abc run or per population (thanks @neuralyzer for explaining).

Viserver not working

Running the visualization server via abc-server /tmp/test.db fails instantly (up-to-date master branch).

Error message:

Traceback (most recent call last):
File "/home/yannik/anaconda3/bin/abc-server", line 11, in
load_entry_point('pyabc==0.8.20', 'console_scripts', 'abc-server')()
File "/home/yannik/anaconda3/lib/python3.6/site-packages/pkg_resources/init.py", line 570, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/home/yannik/anaconda3/lib/python3.6/site-packages/pkg_resources/init.py", line 2751, in load_entry_point
return ep.load()
File "/home/yannik/anaconda3/lib/python3.6/site-packages/pkg_resources/init.py", line 2405, in load
return self.resolve()
File "/home/yannik/anaconda3/lib/python3.6/site-packages/pkg_resources/init.py", line 2411, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/home/yannik/anaconda3/lib/python3.6/site-packages/pyabc/visserver/server.py", line 19, in
from bkcharts import Line # noqa: E402
File "/home/yannik/anaconda3/lib/python3.6/site-packages/bkcharts/init.py", line 17, in
from .builders.histogram_builder import Histogram
File "/home/yannik/anaconda3/lib/python3.6/site-packages/bkcharts/builders/histogram_builder.py", line 27, in
from .bar_builder import BarBuilder
File "/home/yannik/anaconda3/lib/python3.6/site-packages/bkcharts/builders/bar_builder.py", line 21, in
from bokeh.core.enums import Aggregation
ImportError: cannot import name 'Aggregation'

Pre-calibration using sample only when desired

It looks like in smc._initialize_dist_and_eps (called in new), there is always a sample from the prior constructed to initalize distance and epsilon. To reduce computation overhead, it might be of interest to do this lazily, i.e. only when required by any of the components.
E.g. the adaptive distances are intended to render this pre-calibration unnecessary.

Rarely used feature of more than one simulation per proposal parameter yields errors

At least the Epsilon adaptations do not work correctly setting more than one simulation per proposed parameter. This feature is rarely used and not currently tested. So probably not a major issue and rarely done by users but we should still fix that of course.

Save more to History

Currently, e.g. the Transition and Acceptor are not saved to history, though this would be of interest while recovering passed runs. Thus, one should add these to History during initial database setup.
Attention: Backwards compatibility. In particular, the abc-server uses this information.

Acceptor

Kind of similar to Epsilon and DistanceFunction, introduce a class Acceptor or similar, which encodes the acceptance step.
This is so as to allow for more complex acceptance rules than the simple comparison $d \leq \varepsilon$. E.g. acceptance can depend on multiple distance measures and epsilons, or be inherently stochastic (as opposed to a fixed threshold).

The best solution would seem to be to create a new class Acceptor which is passed to ABCSMC, and passed to the model in the accept() method, so that the user can easily override it.

Review kernel density estimators (in particular in visualization)

The default kernel density estimators apparently sometimes over-smoothen. In particular the MultivariateNormalTransition used for visualization is unsuited for multi-model landscapes (see attached picture). The problem in this case could be solved by reducing the scaling factor.
.
At least, one should allow the user to specify a kde here.

Simplify Quickstart Example

Simplify / change the quickstart example to be as short as possible. Show and explain only the necessary basic lines of code always needed. Do not use model selection here (move to separate example), since this is probably not the most common application, rather parameter inference.

StochasticAcceptor: Implement Stochastic Comparator

Add codecov and codacy for monitoring code quality

As done already for pypesto, add codecov and codacy to keep track of changes in the code quality and test coverage.

Also, show badges on the README.md page.

Fix FutureWarning

Following warning occurs on running pyabc:

FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.

in:

pyabc/storage/history.py:200
pyabc/transition/multivariatenormal.py:64
pyabc/smc.py:729

among others.

Documentation cannot compile

The compilation of the documentation on readthedocs fails. This seems to be due to some name confusion of flask and flask-bootstrap: when searching for flask, flask-bootstrap is found as closest match, which does not make sense. We were so far not able to reproduce this error.

Numpy storage correct?

In storing and loading dataframes and numpy arrays, there are some build-around functions. I in particular wonder what the line https://github.com/ICB-DCM/pyABC/blob/master/pyabc/storage/numpy_bytes_storage.py#L50 is supposed to do? And could it fail in the special case of size-1 numpy arrays?

QuantileEpsilon

Generalize the MedianEpsilon to allow arbitrary quantiles, with both weighted and non-weighted distances.

Run notebooks as part of the travis test

In order to automatically make sure all tests are still working, it would be good to run all as part of the travis tests (maybe except ion_channels which might be a bit difficult to set up).

StochasticAcceptor: Implement Different modes of passing/defining pdf_max

Pre-defined / updated as max. over all previous / guessed + updated?

Location: Comparator?

Remove PopulationStrategy.nr_samples_per_parameter

This feature is to my knowledge no-where really used. It causes trouble with weighting particles vs. summary statistics normalized to 1. And if multiple sampling per parameter is desired, this is very easy to implement on the user side in a much more flexible way.

If there are no objections, this feature will thus be discontinued.

This will be possible without changing the database format, though that is not completely clear yet.

seaborn fails

The seaborn.PairGrid, as used in pyabc.visualization.plot_kde_matrix, seems to have changed its api. namely, one runs into TypeError: cannot concatenate object of type "class 'numpy.ndarray'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid, triggered by pyabc.visualization.py line 310 df = pd.concat((x,), axis=1). Reason seems to be that seaborn since version v0.9.0 (July 2018) converts an array to numpy before passing it on to the plotting function (https://github.com/mwaskom/seaborn/blob/master/seaborn/axisgrid.py#L1390).

Suggestion: We need the column name, so check whether this is passed somewhere else, or just drop seaborn.

Jupyter Notebooks Interpreted Expressions

To improve readability when running the provided jupyter notebook examples, not only viewing them on readthedocs,io, move esp. all mathematical expression to markdown text instead of reST, because the latter cannot be interpreted by jupyter.
Links to classes etc. will need to remain reST to allow relative linking.

Set up new branching system w/ master and develop

Set up a new branching and commit system, so that developers always merge into develop (which is also monitored on travis and rtd, ...), and only fully-fledged stable versions are merged into master, getting a new version number. See also https://nvie.com/posts/a-successful-git-branching-model/.

Also, for every commit to master automatically (via travis) upload the new version to pypi, and created a tag (https://git-scm.com/book/de/v1/Git-Grundlagen-Tags).

Adaptive Distances: Only record all summary statistics when needed

Make samplers process and return only required summary statistics, in order to avoid communication overhead if distance functions only need the accepted summary statistics, but not all sampled summary statistics.

Sampler options

Create a new way of passing additional information to samplers (e.g. as by now, to record also non-accepted particles or a strategy of which particles to record), generically. Also, reduce code duplication in the different samplers.

Analysis tools

Apart from the visualization, it would be good to also offer some functions that compute percentiles, a-posteriori expectations and related things, to summarize the a-posteriori distribution encoded in the sample population.

plot_kde_matrix: wrong axis labels for diagonal

For the diagonal, the y axis label is wrong bc taken as the range of the y parameter.

Remove unnecessary dependencies from rtd build

The rtd build is on the limits of the allowed 15mint time frame. This is mainly bc there are many dependencies like r-base installed. I think that these are not necessary if we run all notebooks in advance, so that nbsphinx does not need to do that any more.

StochasticAcceptor: Implement Sample result post-evaluation

After a sample round has finished, check whether any assumption was hit (e.g. a distance found that is above the previously assumed maximum distance), re-evalulate the population, and possibly run some more samples.

Iterate this until sufficiently many samples accepted.

Credible intervals

Allow to (probably from the samples) compute credible intervals for the parameters in the approximate posterior distribution.

Redis server waits for killed workers

Sometimes the redis server reportedly does not get notified when one of its workers got killed, and then waits for eternity for this worker to come back.

When exactly this happens has not been reproduced yet.

Easy logging

Implement comfort functions that allow to log to console or file. E.g. log_to_console(level=logging.DEBUG). Also use in the diverse loggers __name__ in logging.getLogger() to improve tracking back the messages to their respective modules, and having a super module pyabc.

Cite paper

Both on github and in the documentation on readthedocs, the correct paper for pyabc in its final version should be linked, as well as a "bibtex entry".

Adaptive distances

implement the possibility to have distances adapt weights according to data. Based on [Prangle. Adapting the ABC distance function. 2015] and adapted for pyabc.

Documentation improvement: Better explanation of how to use anaconda

It is not yet very clear from the documentation how to handle anaconda.
Make more explicit hot to set the paths and explain how to use source activate

Adaptive Distances: Bound eccentricity

From a theoretical points of view, convergence of the approximate posterior to the real posterior can be proven provided bounded eccentricity of the acceptance regions. Transfered to the adaptive distances, this implies that the ratio of maximum and minimum weight should be bounded in order to get the theoretical confirmation that the convergence is correct.

In practice, this seems to be of minor importance, yet it might be good to add a field to the implemented adaptive distance functions to allow the user to set a maximum eccentricity.

Using R in pyabc

I tried to get running the example using_R
and I got the error that the package external is not found.
I came up with a solution, but I do not feel very confident yet with git to try to fix it....

This can be easily fixed by adding in __init__.py the following line:
from .external import R

and then include the entry "External" in the array __all__

This makes the example run smoothly.

Fix readthedocs build

in addition to #48 , since recently the readthedocs build does not start at all. It stops at conda install ... using the environment.yml file, without any output.

Possible problem: R dependencies cannot be installed. Do we really need those on rtd?

Epsilon: Better arguments

In Epsilon.call(), always pass a valid population (maybe taken from history?). The SMC class should always make sure it gives the correction population. A problem to be checked is that this works reliably when the ABC resumed.

StochasticAcceptor: Use weighted distances

For predicting the acceptance rate, use weighted distances instead of unweighted ones.

Export does not include all data

When exporting, e.g. via abc-export --db results.db --out exported.feather --format feather, sometimes the result table does not include all columns, e.g. not the parameter value ones. This seems to happen when exporting only a single generation (everything fine with --generation=all), because then the tidy functionality in https://github.com/ICB-DCM/pyABC/blob/master/pyabc/storage/history.py#L826 apparently messes things up a bit.

Fix EPSMixin

Make the epsmixin sampler fully object-oriented.

Also, there seem to be race conditions occuring sometimes and leading to errors, the reasons of which have not been discovered yet.

Saving to file system takes a long time

We ran two cases of the same model, once with 1000dim sumstats, once with 6dim sumstats. The number of samples was almost the same for both cases (200-400 for 100 acceptances), but the first one took several times as long. Once we tried accessing the database during the run, but this was not possible bc it was locked bc pyabc was currently writing into it, apparently, over a long time. This indicates that the writing to database for high-dim sumstats takes a non-neglectable time.
Possible solutions would be to have a flag to not store sumstats, or to move the file system operations to another thread (if available), so that pyabc can continue meanwhile. Maybe there is also a more efficient data format, but the SQL database is already quite nice and readable, so that I would not want to change that.

Error when git not installed

import pyabc fails when git is not installed due to en error in the gitpython package.
Steps to reproduce:

import pyabc

Improve chemical_reaction.ipynb

In the notebook, there are some minor inaccuracies:

the description and code implementation of the distance are normalized differently
the Y component is observed, not X

Enhanced Adaptive Distances

At some point (not now) it might be interesting to adapt Algorithm 5 in [Prangle. Adapting the ABC distance function. 2015], since it gives an even improved guess of the distance weights. However, it is not so easily integrable with the pyABC framework (and might lead to increased sampling times when the weights are rather homogeneous?).

Typo in the docs

Small error in the documentation on this page: http://pyabc.readthedocs.io/en/latest/what.html

It reads:

What you don’t need
the likelihood function: p(parameter|data) is not required.

Should be this right?:

the likelihood function: p(data|parameter) is not required.

p.s. I really like this library!! Just what I needed and very easy to use so far.

Create tests for adaptive distances feature

Check load method

When a ABCSMC run is continued (e.g. in order to run a few more populatons because the estimate is not satisfying yet), the smc.load() method is called. Here, not all parameters adjusted during the abc run can be re-initialized (e.g. distance weights, epsilon etc.). It must be checked in detail what is done here, and e.g. adapt the method to perform some history-based initialization (using the last populaton instead of the prior predictve distribution).

But works well enough so far.