stan-dev / cmdstanpy Goto Github PK

CmdStanPy is a lightweight interface to Stan for Python users which provides the necessary objects and functions to compile a Stan program and fit the model to data using CmdStan.

License: BSD 3-Clause "New" or "Revised" License

Python 78.06% Jupyter Notebook 18.37% R 1.58% Stan 1.67% Makefile 0.11% Batchfile 0.10% HTML 0.08% C++ 0.03%

cmdstanpy's People

Contributors

Stargazers

Watchers

Forkers

nityamd daikonradish codelovingyogi wsuchy rndpolytope thomasjpfan catchenal markhannel bparbhu enghuiy t-rodriguez annapaux kshvir wesbarnett kmglassman aaryanmontana evgenyneu siegelordex nels stappit trainorpj hyunjimoon jrnold michaelpmcdonald b-r-oleary felixthebeard fagan2888 sfarhd14 ahartikainen cescalara tcbegley chaimolina akirayou aricakristel cwindolf meereeum clamped-params teddygroves nourfahmy maresb annu-ps31 srossell segasai odidev tcuongd tillahoffmann qres vtraag gibsramen grburgess harozudu annuupadhyayps fayel-cyber maedoc cmgoold defjaf mjhajharia heavywatal chris75vie ahsiungsg johannesbuchner oriolabril gmweaver seanpm2001 arloclarke jonathanlstoff nranthony

cmdstanpy's Issues

Use black to format code

Summary:

Formatting code with black makes it more clean.

https://github.com/ambv/black

Description:

Black automatic formatting reduces developer hours and keeps codebase clean.

Additional Information:

Developer environment needs to be Python 3.6, but the formatted code is still supported by 2.7

windows install script or instructions for CmdStan

Summary:

CmdStanPy requires a local install of CmdStan. The environment variable CMDSTAN is used to specify the path to the CmdStan install. If this variable isn't set, CmdStanPy uses default directory location ~/.cmdstanpy. The script make_cmdstan.sh downloads and compiles the latest release of CmdStan from GitHub. By default it installs the latest version of CmdStan in the default directory. Flags -d and -v are used to specify the directory and version, respectively.

Modify script make_cmdstan.sh as needed for Windows or create a Windows-appropriate install script or installer for CmdStan.

Additional Information:

Current Version:

add logic to `sample` function to handle dense HMC metric

Summary:

Sample function should allow HMC sampler argument metric which takes one of two values: diag_e (diagonal Euclidian) or dense.

Description:

Add metric attribute to SamplerArgs object - update functions validate and compose_command accordingly
Add metric argument to sample command

Note: if metric is dense and metric_file is non-null, then mass_matrix must be a matrix; else if metric_file is non-null, specified mass_matrix must be a vector (i.e., just the diagonal of the mass_matrix).

Additional Information:

Provide any additional information here.

Current Version:

Add method `variational`

Summary:

Add methods to run the CmdStan variational method and to retrieve estimates from the resulting Stan csv files.

Description:

To add this functionality we need to:

add method variational to Model class
add new class VariationalArgs in file cmdstan_args.py and allow this type to be one of the types allowed for method attributes of a CmdStanPy object.
add new class StanADVI which holds resulting inference
add methods to parse this information out of the Stan csv file
unit tests for all of the above
ideally, Jupyter notebook showing how to use this - input data, model, run, output, etc.

Additional Information:

See issue #58

Current Version:

refactor - remove analysis of output from RunSet object

Summary:

Refactor RunSet object so that only runs the sampler and records the names of the output.csv files. Add standalone function to call CmdStan's stansummary method instead.

Description:

Currently the RunSet object runs one or more chains and analyzes the output.csv files via the function stansummary_csv which populates the RunSet properties summary attribute.
Simplify this object removing the summary attribute and all properties which access it.

Additional Information:

See the functional spec https://github.com/stan-dev/design-docs/blob/master/designs/0002-cmdstanpy_func_spec.md for full details

Current Version:

PyPi package?

Summary:

Would like to install this from pip if possible!

More verbose code

Summary:

What do you think if the code would be a bit more verbose?

Now there are many locations where I'm not totally certain what some values should be

# this is example
_, k, v = p
c, v = ... 

_, header, tail = p

or some things could be even more verbose. I'm not saying that we fill the code with comments, but just have descriptive variable names (in some places).

function `sample` argument `inits` should allow initialization function

Summary:

The sample command inits is used to specify some or all initial parameter values.

Add functionality as in RStan:

Set initial values by providing a function that returns a ~~list~~ Dict for specifying the initial values of parameters for a chain. The function can take an optional parameter chain_id through which the chain_id (if specified) or the integers from 1 to chains will be supplied to the function for generating initial values.

Main purpose of the library

Hi, what should be the main purpose of this library.

Is it just a helper library to call cmdstan with a pythonic interface? How important is it to read output csv and handle csv information?

How much do you want to add extra functionality on top of that?

I know there is the wiki/docs/specs somewhere, but I'm just asking for your general opinion.

I made some points, some are probably wrong:

[+]  0. Automatic setup of the CmdStan?
[+]  1. Text to textfile
[+]  2. Translate stan --> hpp with stanc
[+]  3. Compile hpp
[/]  4. Data from python -> text file
[+]  5. sample/vb/etc
[ ]  6. Read output?
[+]  7. Call stansummary
[+]  8. Call/check diagnostics
[-]  9. plotting
[-] 10. analyze results

Meaning:

[+] this lib
[/] maybe this lib?
[ ] maybe or maybe not?
[-] not this lib

update docs

Summary:

current contents of docs dir weren't updated. do immediately!

Allow use of Dockerized CmdStan

Summary:

CmdStan on Windows is (still) a pain, see here for example, but generally toolchains are a "system" thing, testing différent versions would be also easier if Dockerized.

Cmdstanpy invokes CmdStan exécutables via command line, and it should be straightforward to swap out a direct invocation for a invocation via Docker: compilation & running go from

make -C $CMDSTAN_PATH O=3 path/to/bernoulli 
path/to/Bernoulli sample $args

docker run stan/cmdstan make -C $CMDSTAN_PATH O=3 path/to/bernoulli 
docker run stan/cmdstan path/to/Bernoulli sample $args

There's some path mapping to think about but it's certainly easier than trying to compile CmdStan wtih MSVC.

Most of it should be transparent to the user who would specify CMDSTAN_PATH=docker or similar.

add method `Fixed_Param`

Summary:

Implement logic for algorithm=Fixed_Param

Description:

from the CmdStan manual:

Fixed Parameter Sampler
The fixed parameter sampler generates a new sample without changing the current state of the Markov chain; only generated quantities may change. This can be useful when, for example, trying to generate pseudo-data using the generated quantities block. If the parameters block is empty (no parameters) then using algorithm=fixed_param is mandatory.

This generates the same output as the sample command.

~~- add method fixed_param to Model class~~
~~- implement placeholder class FixedParamArgs in file cmdstan_args.py as needed.~~

unit tests
jupyter notebook

Update: "algorithm=fixed_param" is part of method sample - added boolean arg fixed_param, default is false. don't need additional methods or class FixedParamArgs.

Additional Information:

"Fixed_Param" is a terrible name. alternatives: "run_program". Need to get across idea that sample is not generated by MCMC, instead using RNG functions to generate outputs.

Current Version:

Setup CI/CD

Summary:

How to do CI for cmdstanpy

Description:

Previous development used a Docker build to run tests & push tags as packages to PyPI, the question here is if that is to be retained and how (build on Travis, etc).

Additional Information:

In repo, cf Dockerfile & .gitlab-ci.yml

Current Version:

N/A

CMDSTAN now CMDSTAN_PATH

Summary:

Looks like now cmdstanpy uses the CMDSTAN_PATH env variable instead of the CMDSTAN env variable to track the path of cmdstan. I think this just needs the docs and possibly tutorial ipython notebook to be updated.

[edit] actually apparently you can't set the cmdstan location anymore! I would personally really like to be able to set it...

[edit 2] I lied, you can set it:

import cmdstanpy
cmdstanpy.CMDSTAN_PATH = "/path/to/cmdstan/

RunSet's repr throws an error

Summary:

RunSet's __repr__ is broken.

Description:

Followed the instructions in the notebook.

import os
import os.path
from cmdstanpy import cmdstan_path, compile_model, sample, get_drawset, summary, diagnose

bernoulli_path = os.path.join(cmdstan_path(), 'examples', 'bernoulli')
bernoulli_stan = os.path.join(bernoulli_path, 'bernoulli.stan')
bernoulli_model = compile_model(bernoulli_stan)
bern_data = { "N" : 10, "y" : [0,1,0,0,0,0,0,0,0,1] }

bern_fit = sample(bernoulli_model, data=bern_data)

bern_fit . # Error!

Additional Information:

Offending line is here: https://github.com/daikonradish/cmdstanpy/blob/master/cmdstanpy/lib.py#L500

(Should be self._args)

Current Version:

implement wrapper function to cmdstan `bin/stansummary`

Summary:

Implement the function described in the CmdStanPy Functional Spec:

https://github.com/stan-dev/design-docs/blob/master/designs/0002-cmdstanpy_func_spec.md#summary
summary(runset = sampler_runset, output_file= "filename")

save PosteriorSample csvfiles to permanent location

Summary:

The default location for the sampler csv files is under /tmp. Add functionality to PosteriorSample object which allows user to move them into a user-specified directory.

Description:

Add function to PosteriorSample object - argument dirname (or similar) that saves the CSV files to a specified directory. Add appropriate checks and error handling.

investigate whether or not `make` works on all supported platforms.

Summary:

On some systems make is not called make.
Investigate ways for users to set up their environment to deal with this.

Description:

see discussion here: #45 (comment)

Python 2 support

Summary:

cmdstanpy does not support Python 2

Description:

Originally written for Python 3, cmdstanpy uses type annotations among other features of Python 3 that are not supported in Python 2, despite Python 2 still being widely used (official EOL in 2020).

Additional Information:

It would be not too much effort to support Python 2 by removing use of incompatible features, if there's demand.

Current Version:

v0.0

Posix style path to make command

Summary:

Calling CmdStan make needs posix style paths for it's argument.

To fix this, for all command calling make do:

path.replace("\\", "/")

`sample` function should allow CmdStan `engaged` argument

Summary:

CmdStan's HMC sampler has argument engaged which controls whether or not the sampler does adaption. when engaged=0 the sampler skips adaptation altogether and will use whatever values the stepsize and mass_matrix are set to.

Description:

Add engaged attribute to SamplerArgs object - update functions validate and compose_command accordingly
Add engaged argument to sample command

Additional Information:

Extensive discussion of proper use of options engaged/ num_warmups here: stan-dev/cmdstan#604

Current Version:

Address any Arviz integration issues

Summary:

ArviZ (pronounced "AR-vees") is a Python package for exploratory analysis of Bayesian models. Includes functions for posterior analysis, model checking, comparison and diagnostics.

from #14, the previous pycmdstan included some plotting code that was agreed better replaced by a downstream library in a typical workflow, ArviZ came up. It would be nice to

address any integration issues
include in a short guide showing how easy it is

refactor - conflate RunSet, PosteriorSample

Summary:

There's too much overlap between RunSet and PosteriorSample objects - investigate how to refactor.

Description:

The sample command instantiates a RunSet object which keeps track of the number of chains, config, return code and output files. If all chains run to completion successfully and return the same number of draws, then the sample command instantiates and returns a PosteriorSample object.

The essential feature of the PosteriorSample object is that it lazily instantiates the in-memory representation of the sample from the stan_csv output files, but this logic could just as easily be done by a RunSet.

PosteriorSample functions summary and diagnose should first-class functions which take a RunSet argument.

Additional Information:

Initial versions of the functional spec had this design.

Current Version:

Fix doc references to PyCmdStan

Summary:

There are references (e.g. http://cmdstanpy.readthedocs.org) to pycmdstan we should change to cmdstanpy. (we might even want to change the Python module to just cmdstan or stan).

expose CmdStan methods `optimize`, `variational`, `generate_quantities`, sampling with `fixed_param`

Summary:

The beta version of CmdStanPy only has function sample which runs the NUTS-HMC sampler. Expose the following additional CmdStan methods:

optimize
variational
generate_quantities
sampler algorithm fixed_param which is used for Stan programs without any parameters.

This is an umbrella issue. We need per-method issue to break this down into managable chunks. This issue is intended to be used to work out the best possible factorization of the CmdStanPy code base, as well as the best function and argument names for cross-interface compatibility.

Description:

The CmdStan CLI has common arguments and method and algorithm-specific arguments.
As currently implemented, CmdStanPy has a helper class SamplerArgs which is a composite of the data, output, and NUTS sampling algorithm arguments. Keep as a single composite, or refactor into sub-compontents?

Each method writes its output to the specified output file in a method-specific format - this requires writing additional parsing routines, and adding methods to the StanFit (fka RunSet) object to return this information using the appropriate Python object.

Additional Information:

The CmdStan argument names don't line up with RStan/PyStan names:

sample -> sampling
optimize -> optimizing
variational -> vb

currently only CmdStan exposes the method stan::services::generate_quantities. should this be called generated_quantities? This seems like the better name, in which case, we should change the CmdStan CLI arguments as well.

Current Version:

C++ toolchain - install instructions or lightweight download?

Summary:

CmdStanPy provides script install_cmdstan but that depends on already having a C++11 (or higher?) compiler installed. Investigate how to automate toolchain install and/or provide platform-specific foolproof step-by-step instructions

Description:

Installation has always been a huge pain point for non-programmers, which is the category that many working statisticians and data analyists fall into. For them, we need a painless install experience.

Current Version:

`chains=` argument to `sample` doesn't seem to run chains in parallel

Summary:

I'm running sample(model, data=dict(...), chains=4) and it just starts 1 chain at a time.

Current Version:

develop as of 6cfbc99

Add functionality to download compiler toolchain

It would be awesome to provide a scripted shortcut in the cmdstanpy library that either helps a user install a compiler toolchain or downloads one for them to some specified directory that we then use. Ideally the latter so that we could help users without administrative privileges (especially common on Windows).

R has a package that attempts to perform the global version: https://github.com/r-lib/pkgbuild. I wonder if something similar exists in the Python world that we can adapt.

implement `run_generated_quantities` method

Summary:

Add methods to run the CmdStan generate_quantities method and to retrieve estimates from the resulting Stan csv files.

Description:

The generated quantities is used to define values that depend on parameters and data - "quantities of interest" or QOIs. This includes predictive inferences as well as forward simulation for posterior predictive checks. The standalone generated quantities method allows users to ask additional questions of a fitted model by taking an existing sample given a model and data, and for each draw in the sample, use the fitted parameter values for that draw to run just the standalone generated quantities block.

The inputs to the standalone generated quantities method are:

a sample generated from an given model and dataset (as a Stan csv file)
a new version of that model which has the same data and parameters but which defines new/different variables in the generated quantities block
the data used to fit the model

To add this functionality we need to:

add method run_generated_quantities to Model class
add new class GenerateQuantitiesArgs in file cmdstan_args.py and allow this type to be one of the types allowed for method attributes of a CmdStanPy object.
unit tests for all of the above
ideally, Jupyter notebook showing how to use this - input data, model, run, output, etc.

If possible, the method name should be both concise and descriptive - run_generated_quantities seems too long and run_gqs seems too short.

Additional Information:

The output csv file is essentially the same as the output from the sample command, but minus sampler state information. should the result be exposed as property sample or should there be a different property generated_quantities ?

Current Version:

Support ujson if available

Summary:

If ujson (drop-in replacement for stdlib's json module) is available, use that instead as its several orders of magnitude faster. I think this is a 4 line change:

try:
    import ujson as json
except ImportError: 
    import json

That should get us within @maedoc's 5% of fit time quite easily.

==================
WAS: request for faster data transfer
Ideas:

Unix sockets (or in general allowing users to pass in the file handle for whatever they want to use to do communication between CmdStanPy and CmdStan). This should be a pretty quick fix to the code, just allowing for an additional argument.
Something faster to parse and create than JSON (I know this would also require changes in CmdStan). This post compares a few options: https://yuhui-lin.github.io/blog/2017/08/01/serialization

Use pytest

Summary:

Let's upgrade our tests to pytest.

Description:

Use pytest for tests.

Additional Information:

This will enable us to do clever things.

refactor code base - rename files, classes, methods

Summary:

Change names to match functional spec - see https://github.com/stan-dev/design-docs/blob/master/designs/0002-cmdstanpy_func_spec.md

Description:

The CmdStanPy is copied from base https://github.com/maedoc/pycmdstan, as it is very close to what we want in a lightweight wrapper. As the first step of the refactor, rename directories, files, classes and methods according to the spec.

Current Version:

Use logging module instead of print

Summary:

Currently we log our messages via print method call which makes it impossible to control it

Description:

Logging should be done using import logging (the same way as in fbprophet)

add logic to csv parsing to handle saved warmup iterations

Summary:

If save_warmup is True, sampler writes warmup iterations to stan_csv file.
The current parsing logic doesn't account for this. This is a bug.

People like to use save_warmup to try to diagnose model problems and get an idea of how many warmup iterations are necessary. Determine whether or not this is good practice. If it is, spec out the functions (if any) that should be added to CmdStanPy to allow users to do this. If not, document other ways to do this in CmdStanPy . This is a feature.

Description:

Warmup iterations come directly after csv header line, before the comment lines which report adaptation termination, step size and metric.
Add logic to parser routines to check for rows following header but before adaptation section.

Additional Information:

In the long run, better output handling by core Stan should make the csv files go away. Until then, we make do.

Current Version:

jupyter notebooks for common use cases

Summary:

Write jupyter notebooks for common use cases - flavors of data, estimation methods, common models.

Especially need a notebook or two showing how to use other packages to do visualizations and analysis of either the StanFit object sample attribute (3D numpy nd.array, draws X chains X samples, stored column-major, i.e. Fortran style) or the pandas DataFrame returned by the StanFit object's get_drawset method.

Description:

Along with the readthedocs docs, create a series of jupyter notebooks with common use cases. Organize this in a notebooks folder under top-level directory docs, or whatever is standard practice in for Python packages.

Additional Information:

There is currently a top-level notebook https://github.com/stan-dev/cmdstanpy/blob/master/cmdstanpy_tutorial.ipynb - this should be moved to the notebooks folder.

Current Version:

API changes - rename `RunSet`, make first-class functions into class methods

Summary:

All interfaces should use similar if not identical names for functions and objects as much as possible so that we can leverage documentation, teaching materials, etc.

In particular the name RunSet is not a good innovation. Change to StanFit

In anticipation of a lightweight R wrapper, make first-class functions compile_model and sample into Model object class methods compile and sample.

Description:

Change the API and code organization. Split file lib.py into modules Model and StanFit and SamplerArgs. Make first-class functions in cmds.py into class methods objects.

This will clear up packaging dependencies as well.

Current Version:

Beta

add progress bar or similar to `install_cmdstan` and `compile_model` and `sample` functions

Summary:

We need a way to indicate to the user that a script or process is alive and running.

Looking for a solution which is lightweight and robust - i.e., introduces minimal processing overhead and runs on all platforms.

Description:

In the install_cmdstan script, the following operations may take a long time to complete:

downloading the release tar.gz file from GitHub
unpacking it
compiling the CmdStan binaries and the Stan model headers

Likewise, the cmdstanpy functions compile_model and sample should have progress indicators. For the latter, one progress bar per chain would be nice.

Random number generator not configured by seed

Summary:

Seed not passed to random number generator. Inconsistent results were generated when running the code multiple times.

Description:

As in gist file https://gist.github.com/rndpolytope/704fbad70a379e4143d62dcc6b68d592

Additional Information:

Current Version:

cmdstan 2.18.1, cmdstanpy 0.9.0

Travis CI script to build CmdStan is failing

Summary:

Travis CI script to build CmdStan is failing

Description:

The CI script to build CmdStan seems to be failing:

$ ./make_cmdstan.sh
~/build/stan-dev/cmdstanpy/releases ~/build/stan-dev/cmdstanpy
release dir: /home/travis/build/stan-dev/cmdstanpy/releases
latest cmdstan: 
download rc code: 0
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
tar rc 2
./make_cmdstan.sh: line 40: cd: cmdstan: No such file or directory
make: *** No rule to make target 'build'.  Stop.
installed cmdstan-
ls: cannot access 'releases/*': No such file or directory

Additional Information:

Affecting PR #32

Current Version:

add method `optimize`

Summary:

Add methods to run the CmdStan optimize method and to retrieve estimates from the resulting Stan csv files.

Description:

To add this functionality we need to:

add method optimize to Model class
add methods to parse this information out of the Stan csv file

the current set of csv parsing utilities are specific to the csv produced by the NUTS-HMC sampler. parsing the optimize csv is far more straightforward.

Additional Information:

Provide any additional information here.

Current Version:

Extract step-size and mass matrix from sampler csv output file.

Summary:

Extend the CmdStanPy csv file functions to find and extract the step size and mass matrix from the output file.

Description:

Describe the issue as clearly as possible.

Additional Information:

Provide any additional information here.

Current Version:

Add functionality to manage cmdstan git repo

I think it would be great if we had convenience functions for users to specify a specific version of CmdStan and have CmdStanPy manage the cmdstan repo itself.

Configure “algorithm=fixed_param” in new API

Summary:

It seems not an option to configure algorithm=fixed_param in the new API.
In the old API, this is done by using keyword argument. However, this is not an option as in the new model class.

Current Version:

Newest commit as of Jun 18.

cmdstan install script: change default location, add args for location, version

Summary:

the script make_cmdstan.sh currently installs CmdStan in the module directory, but this will create problems if installed using pip, as pip uninstall won't be able to remove the releases dir. therefore the default location should be in the user's home directory.

as part of this change, add arguments to the script allowing users to specify both the download location and the version of CmdStan to be installed.

Description:

see discussion here: #35 (comment)

Additional Information:

Current Version:

Python3.5 support

Summary:

There are a few uses of Python3.6's new fstring string interpolations (e.g. f'--o={hpp_name}' in model.py. If we replace those with str.format etc then we can support many more versions of Python.

Read Stan csv to StanFit

We need (or do we have this already) a function that can take a list of (stan) csv and outputs StanFit.

investigate granularity of exceptions required for subprocesses as well as CmdStan return codes - are they useful?

Summary:

Reviewer comments on PR #21 raise several good points -

If return codes from CmdStan are semantic (1 -> unable to initialize, 2 -> max iters w/o convergence, …), they would be best translated to specific exceptions. If they aren't, a RuntimeException would be appropriate, with a separate method for returning stderr lines. Return an error msg full of retcodes here isn't going to be particularly valuable.

and:

It's sometimes useful to use an API specific Exception subclass to ensure you don't catch things you shouldn't be catching. Python mostly gets this right for you, e.g. you won't catch a KeyboardInterrupt, but if there are other exceptions that should bubble up to the user, it may be preferable to catch a more specific set of exceptiosn e.g. OSError, subprocess.CalledProcessError.

Breaking this out into an issue to be addressed in its own PR.

Description:

Additional Information:

Provide any additional information here.

Current Version:

Feature / Run CmdStan in background

Summary:

Start sampling, stop python; ... start python, get progress or read csv files

Description:

It could be helpful to have possibility to start cmdstan sampling to background and then later get the csv (or use some other interface).

Additional Information:

Similar interface as

nohup model sample output.csv  > sample.log &

function `sample`, argument `metric` should allow Dict

Summary:

The user should be allowed to specify the metric argument as a Dict with entry int_metric which is a numpy.ndarray consisting of a vector or square matrix.

Description:

The metric argument is used to initialize the metric used by the sampler during adaptation. It is either a vector consisting of the diagonal entries of the covariance matrix or the full covariance matrix, corresponding to metric types diag_e and dense_e.

CmdStan operates on files, so when the input is a Dict, it must be written to a temp file in JSON format - see how this is done for data and inits args.

Additional Information:

Update sample cmd allowed argument types, and make corresponding changes to the SamplerArgs object and the unit tests.

Users will most likely get the ndarray by running the sample command, and then accessing the resulting StanFit objects' metric property. Although this property is a list of per-chain metrics, when specifying an initial metric, the same metric is used for all chains, therefore it is a single entry, not a list.

Current Version:

refactor - standalone function `sample` instead of member function of Model

Summary:

Refactor Model class so that member function sample is a standalone method with named arguments for all controls on the HMC-NUTS sampler.

Description:

Currently the sample command is a member function of the Model class. Refactor as a standalone function which takes a model object as first argument.

Additional Information:

See the functional spec https://github.com/stan-dev/design-docs/blob/master/designs/0002-cmdstanpy_func_spec.md for full details