bambinos / bambi Goto Github PK

View Code? Open in Web Editor NEW

1.1K 31.0 119.0 607.14 MB

BAyesian Model-Building Interface (Bambi) in Python.

Home Page: https://bambinos.github.io/bambi/

License: MIT License

Python 99.54% Shell 0.46%

bayesian-inference python regression-models statistical-analysis bayesian-statistics statistical-modeling

bambi's Introduction

BAyesian Model-Building Interface in Python

Overview

Bambi is a high-level Bayesian model-building interface written in Python. It's built on top of the PyMC probabilistic programming framework, and is designed to make it extremely easy to fit mixed-effects models common in social sciences settings using a Bayesian approach.

Installation

Bambi requires a working Python interpreter (3.10+). We recommend installing Python and key numerical libraries using the Anaconda Distribution, which has one-click installers available on all major platforms.

Assuming a standard Python environment is installed on your machine (including pip), Bambi itself can be installed in one line using pip:

pip install bambi

Alternatively, if you want the bleeding edge version of the package you can install from GitHub:

pip install git+https://github.com/bambinos/bambi.git

Dependencies

Bambi requires working versions of ArviZ, formulae, NumPy, pandas and PyMC. Dependencies are listed in pyproject.toml and should all be installed by the Bambi installer; no further action should be required.

Examples

In the following two examples we assume the following basic setup

import arviz as az
import bambi as bmb
import numpy as np
import pandas as pd

Linear regression

A simple fixed effects model is shown in the example below.

# Read in a dataset from the package content
data = bmb.load_data("sleepstudy")

# See first rows
data.head()
 
# Initialize the fixed effects only model
model = bmb.Model('Reaction ~ Days', data)

# Get model description
print(model)

# Fit the model using 1000 on each chain
results = model.fit(draws=1000)

# Key summary and diagnostic info on the model parameters
az.summary(results)

# Use ArviZ to plot the results
az.plot_trace(results)

   Reaction  Days  Subject
0  249.5600     0      308
1  258.7047     1      308
2  250.8006     2      308
3  321.4398     3      308
4  356.8519     4      308

       Formula: Reaction ~ Days
        Family: gaussian
          Link: mu = identity
  Observations: 180
        Priors:
    target = mu
        Common-level effects
            Intercept ~ Normal(mu: 298.5079, sigma: 261.0092)
            Days ~ Normal(mu: 0.0, sigma: 48.8915)

        Auxiliary parameters
            sigma ~ HalfStudentT(nu: 4.0, sigma: 56.1721)

                   mean     sd   hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
Intercept       251.552  6.658  238.513  263.417      0.083    0.059    6491.0    2933.0    1.0
Days             10.437  1.243    8.179   12.793      0.015    0.011    6674.0    3242.0    1.0
Reaction_sigma   47.949  2.550   43.363   52.704      0.035    0.025    5614.0    2974.0    1.0

First, we create and build a Bambi Model. Then, the method model.fit() tells the sampler to start running and it returns an InferenceData object, which can be passed to several ArviZ functions such as az.summary() to get a summary of the parameters distribution and sample diagnostics or az.plot_trace() to visualize them.

Logistic regression

In this example we will use a simulated dataset created as shown below.

data = pd.DataFrame({
    "g": np.random.choice(["Yes", "No"], size=50),
    "x1": np.random.normal(size=50),
    "x2": np.random.normal(size=50)
})

Here we just add the family argument set to "bernoulli" to tell Bambi we are modelling a binary response. By default, it uses a logit link. We can also use some syntax sugar to specify which event we want to model. We just say g['Yes'] and Bambi will understand we want to model the probability of a "Yes" response. But this notation is not mandatory. If we use "g ~ x1 + x2", Bambi will pick one of the events to model and will inform us which one it picked.

model = bmb.Model("g['Yes'] ~ x1 + x2", data, family="bernoulli")
fitted = model.fit()

After this, we can evaluate the model as before.

For a more in-depth introduction to Bambi see our Quickstart and check the notebooks in the Examples webpage.

Documentation

The Bambi documentation can be found in the official docs

Citation

If you use Bambi and want to cite it please use

@article{Capretto2022,
 title={Bambi: A Simple Interface for Fitting Bayesian Linear Models in Python},
 volume={103},
 url={https://www.jstatsoft.org/index.php/jss/article/view/v103i15},
 doi={10.18637/jss.v103.i15},
 number={15},
 journal={Journal of Statistical Software},
 author={Capretto, Tomás and Piho, Camen and Kumar, Ravin and Westfall, Jacob and Yarkoni, Tal and Martin, Osvaldo A},
 year={2022},
 pages={1–29}
}

Contributions

Bambi is a community project and welcomes contributions. Additional information can be found in the Contributing Readme.

For a list of contributors see the GitHub contributor page

Donations

If you want to support Bambi financially, you can make a donation to our sister project PyMC.

Code of Conduct

Bambi wishes to maintain a positive community. Additional details can be found in the Code of Conduct

License

MIT License

bambi's People

Contributors

Stargazers

Watchers

Forkers

mdekauwe benjamesbabala amelio-vazquez-reina dmarinav vikibytes scchess solertis aflaxman redcobbler m9h fmelinscak junpenglao shubhampachori12110095 afcarl jgdwyer gweindel ejolly vishalbelsare dunovank kwayeke holaanna aloctavodia ahartikainen paduel camenpihor keivanb kauttoj gaborstefanics jkaravolias shaonannan kristiankaufmann mafrasiabi ppraykov lilihub gakkilovemath cpfiffer krkulkarni thatch stjordanis navaneeth20 uiuc-arc aidin-zadeh humepac chuanzhidong liyi3344520 tomicapretto robindang0573 hdocmsu frankfan007 oriolabril agustinaarroyuelo maruff pankajkarman talrasha star1327p krassowski camdavidsonpilon sreedat trendingtechnology dsivaji nanduchandra anhnguyendepocen gaowudao ruoyongxu davipatti ai-app haoli2025 alexal14 juanitorduz charlesdong1991 risa-labs geraintpratten cgacian alexjonesphd playfloor aditya-zutshi ngwangshu hectormz apachaves markgoodhead savac ipa thomas4390 aoifehughes brown-ccv rmallof michaelosthege mk0417 gstechschulte bwengals suryamudimi abuzarmahmood pnxenopoulos peternovig mbjoseph chakreshiitgn arpitjain799 wd60622 wnsrl970219 alexanderfengler

bambi's Issues

Implement a backend-independent representation of results

Following the Model/Backend separation, we should create a ModelResult class that stores a backend-independent representation of the sampler results. For now this can be more or less a container for PyMC3 traces, as there aren't any other backends.

Modified default priors for non-normal models

The default priors make good sense for normal models, but are not designed to work out-of-the-box with non-normal models. There are two complications that I can see: defining SD(Y) and defining dm_statistics['r2_y'], which is the R^2 from regressing Y on all the other Xs.

For SD(Y), the value we use should be something that makes sense on the link scale, not necessarily on the observed scale. Commit 98a4a45 introduces an interim solution that handles the binomial/logit and binomial/probit cases (for when the latter is implemented). In those models the variability of Y on the link scale is not identifiable over and above the slopes, so I fix the constant to a value based on setting the variance of the latent logistic or normal variable to 1 (see section 3.5 here for a similar approach). All other family/link combinations are currently just left as-is, i.e., using the observed SD(Y), so that all the models will still run. And this default should be fine when family=normal or link=identity, but for other cases we should think about what sensible defaults are.

For r2_y, this currently uses the classical R^2, but this only really makes sense when Y is normal. For example, if Y is binomial, then really this should be based on a logistic regression of Y on all the other Xs. One thing that could work is to just replace this with a pseudo-R^2 statistic in the non-normal case. But as of commit 98a4a45 we are still just using the classical R^2.

I think the general principle we want to shoot for is that the resulting interpretation of the prior SD would always be in terms of partial correlations between the predictors and the latent outcome in the case of GLM(M)s. I think for binomial/logit models this is accomplished by the current SD(Y) fix + replacing r2_y with a pseudo-R^2. But we should think about how to accomplish this for other family/link combinations.

We need a level_map

Allow only random effects

While unlikely, a user could conceivably want to include only random effects in a model. Currently the fixed argument is mandatory in add_formula; it should be made optional.

Add tests for the full range of models we support

Adding tests for different models should be a high priority. I don't think we need to check for numerical correctness of output just yet (though we'll get there), but we do want to make sure the number of parameters we get back for each term is right, etc. We should add at least one test for every major permutation of model we can support (e.g., fixed effects alone; fixed effects with one random intercept term; crossed random effects; etc.).

I think these can probably be specified with the formula syntax for now, though I think a better approach would ultimately be to specify the same model using both the high and low-level APIs, and then check for equivalency of results (in addition the checking that the trace looks right).

Add option to use flat priors everywhere

To facilitate comparison with lme4 and other packages that use maximum likelihood, we should add a model-level option to use flat priors everywhere.

Support extensible transformations on data

Implement a simple way to extend the set of transformations that can be applied to data in the base Model class (e.g., via a @transformer decorator).

In add_term interface, adding random slopes overwrites corresponding fixed slope

Here is a random slopes model specified both via the formula interface and the add_term interface:

# build model using formula
modelA = Model(crossed_data)
modelA.fit('Y ~ continuous', random=['continuous|subj'], run=False)
modelA.build()

# build model using add_term
modelB = Model(crossed_data)
modelB.add_y('Y')
modelB.add_intercept()
modelB.add_term('continuous', random=False)
modelB.add_term('subj', random=True)
modelB.add_term('continuous', split_by='subj', random=True)
modelB.build()

The model specified via the formula interface ends up with the expected terms:

modelA.terms
# OrderedDict([('Intercept', <bambi.models.Term at 0x196889c50>),
#              ('continuous', <bambi.models.Term at 0x1968864a8>),
#              ('subj', <bambi.models.RandomTerm at 0x19683a908>),
#              ('continuous|subj', <bambi.models.RandomTerm at 0x19686f438>)])

But the model specified via add_term, which I believe should be the equivalent model, has lost the fixed slope:

modelB.terms
# OrderedDict([('Intercept', <bambi.models.Term at 0x19686f0b8>),
#              ('continuous', <bambi.models.RandomTerm at 0x196950358>),
#              ('subj', <bambi.models.RandomTerm at 0x196950ef0>)])

The random terms labels don't match, as you've noted elsewhere, but that's not really a big deal because the terms themselves are the same. Anyway, this problem likely has something to do with the fact that the label for the random slopes coincides with the label for the corresponding fixed slope.

Specify burn-in at ModelResults class level

Currently, burn_in is strictly a method-level parameter (in .plot and .summary), but for convenience and consistency, a user should be able to set it on the ModelResults instance—preferably propagated from the Model.run() call. Then the burn_in argument can default to None instead of 0, and check for a value in the instance before falling back on 0.

Add term name getter to Model

There should be an easy way to pull out all term names in the model.

For categorical X, internally change random terms from `X|subj` to `subj|X`

For adding random slopes for categorical predictors, the usual syntax (e.g., in lme4) is to say X|subj, where X is the predictor and subj is the random factor. Right now this does not give the expected behavior, as it does not add separate random subject terms for each level of X. A simple fix is to internally change this to subj|X. The key is to make this change internal, because in the usual syntax that users will be familiar with, subj|X doesn't really make sense. So instead we should let users write X|subj, but internally we understand this as subj|X,

Throw error when symbols like + given in random terms

The lme4 package in R supports random terms like (X1+X2|subject) which it interprets as (X1|subject) + (X2|subject), so users will likely assume this same behavior is true in bambi. But it isn't true in bambi. Not clear whether we WANT to support such syntax, so easiest solution is just to search random strings for unexpected symbols like '+' and throw an error if found.

Intercept default prior has SD=0 when all predictors are centered

This is not a bug exactly, but an unanticipated consequence of how we determine the default prior for the intercept (the logic is explained in this notebook).

It actually sort of makes sense when you think about it: in an OLS context, if all the predictors are centered, then we know that the estimate for the intercept will be the sample mean of Y. Consistent with this, our defaults assign a prior with mean = mean(Y) and sd = 0 when all the predictors are centered. But this is not really desirable in practice because, among other things, it causes computational problems when trying to run the model.

One way of viewing this is that the problem essentially arises because we treat mean(Y) as a fixed, known quantity, when in fact Y is a function of the random data, so there ought to be uncertainty about mean(Y) even if all the predictors are centered. We know that the OLS estimate of the intercept will equal the sample mean of Y, but this obviously does not imply that we are certain about the population mean of Y, which is what our default method kind of assumes. So a simple fix would be to add var(mean(Y)) or var(Y) to the variance of the intercept prior -- the former option is more technically correct, but could lead to a really narrow prior when there's a lot of data, which could conceivably have other negative consequences, so the latter option could possibly work better in practice, despite being perhaps less principled.

Auto-generate symbolic and verbal representations of model

At some point (post initial release), we can automatically generate symbolic representations (in LaTeX) and verbal descriptions of each model.

Incremental sampling

We should add the ability to sample incrementally (i.e., to restart the sampler with the last sample as the new starting point, and append new samples to the existing MultiTrace).

Contrasts

We need to add support for contrasts, probably in an add_contrast() method that parallels add_term().

Docstrings

The code is pretty poorly documented at the moment. We need to make sure every public method has a detailed docstring. Once that's done, we can think about adding Sphinx support.

Removable discontinuity in slope = f(corr)

Our default priors are based in part on derivatives of a function that maps the partial correlation to the implied slope. But this function has a removable discontinuity at 0 (due to the sign function), so the derivative does not exist at exactly the point where we almost always want to evaluate it. Right now we get around this by just evaluating the derivative at a point very close to 0. This mostly works well, although we have to evaluative it a bit further from 0 for increasingly higher-order approximations or the results get weird -- I worked out what seem to be good distances from 0 for different levels of approximation just by trial and error, but this would all be a lot nicer if we could just deal with the removable discontinuity directly and then just take the derivative at 0. Some initial attempts to deal with this using limit() in Sympy didn't seem to work, but we should return to this issue eventually and see if we can make it work.

Add informative message when prior scaling fails due to non-identifiable fixed effects

When a model with default priors and non-identifiable fixed effects is created (e.g., by calling something like add_term('condition', drop_first=False) when an intercept is already in the model), the prior scaling approach currently implemented in PriorScaler sets the random factor hyperprior variances to inf, and the sampler consequently refuses to start. We should either check for this scenario and substitute some large but sane value, or raise an exception with a more informative message rather than waiting for Theano to complain.

Name change

This package needs a new name that doesn't imply any commitment to either (a) psychological data or (b) PyMC. Down the road, we may want to add other back-ends (e.g., Stan), so we should be maximally agnostic.

Rename level names when returning categorical variables

Patsy will rename categorical variables sequentially, but it's helpful to return results to the user with the original level values for interpretability's sake. By default, we would ideally back-project extracted column names to the original level names in the data passed in. This may or may not be easily feasible given the current implementation, but we should take a look.

Better specification of random effects

Currently random effects must be specified by passing a list that only accepts the pipe operator, and none of the other standard formula operators (e.g., 'subject|group' is fine; '(1+subject)|group' breaks). In the long term we should eventually handle all the operators accepted by lme4, but a shorter-term solution is to check that no invalid operators are passed, and raise an error message if any are detected (e.g., '1+subject|group' will raise an exception). This ensures that people don't run into unexpected behavior where the model runs without complaint, but fits the wrong model.

Improved ModelResults.summary()

Currently ModelResults.summary() just wraps PyMC3's summary() method. It would be good to return basic information about the model--number of terms, number of columns, description of priors used, number of samples run, collinearity diagnostics from the term _setup() calls, etc.

We may also want to suppress (or truncate) the reporting of all random effect parameter estimates by default, since they can spew a lot of output at the user.

Implicitly add random intercept when continuous random slopes added to model

Currently adding a random term like X1|subject, where X1 is a continuous predictor, will add random X1 slopes across subjects, but will not add random subject intercepts. But users will likely expect the intercept to be added. Also, intercepts are implicitly added in the fixed part of the model, so maybe we want to be consistent.

One issue to watch out for when adding this functionality is to make sure that random intercepts for a a factor are never added twice. For example, if a previous term X1|subject implicitly added random subject intercepts, then a subsequent term X2|subject should NOT try to add random subject intercepts a second time.

Extend support from LMMs to GLMMs

Specifically, add support for the following distributions of Y (beyond Gaussian):

Binomial
Poisson

And for the following link functions (beyond the identity function):

Logit
Probit
Log

Calling `add_formula` after `add_y` erases the y variable

For example, this works fine:

model.add_formula('condition + gender', random='condition|subject')
model.add_y('rt')

But reversing the order of the terms does not work:

model.add_y('rt')
model.add_formula('condition + gender', random='condition|subject')

After calling add_formula, the y attribute of the model disappears, so attempting to build the model fails with an error.

Plots!

Yes, those things. We need some of them.

When adding random slopes for categorical X, drop_first if random intercept found

Users will expect a random specification like random=['1|subj','X|subj'] to work, because in most other software the random X slopes are added as K-1 dummies (K being the number of levels of X) rather than the full set of K dummies, one for each cell mean. Our current default for X|subj is to add K random cell means for each subject, but if the user has also specified a random intercept, then this is an unidentifiable model, because it attempts to estimate K+1 means for each subject, but each subject only has K means.

A reasonable solution is the following: when encountering a term like X|subj, check to see if the user also specified a random intercept for subj, i.e., 1|subj. If yes, then use the drop_first parameterization for X|subj, so that we add K-1 random dummies for each subject. If no, then use the cell means parameterization where we add K random cell means for each subject. This way we always add K parameters to the model.

Plot or summarize priors

Since the priors are now adapted to the data, we should have some way of showing the user what the parameters actually were. Either a plot_priors() method comparable to the kdeplots produced by traceplot() or some kind of summarize_priors() that produces a table of distribution names and parameters would do the trick.

Call to missing pymc3 attribute (dot)

Thank you for the wonderful package. I'm running into the following issue though: model.fit() complains that there is a missing attribute in pymc3 (dot):

data = pd.DataFrame({ 'one' : pd.Series([5, 4, 3, 1]), 'two' : pd.Series([1, 2, 3, 4]) })
model = bambi.Model(data)
modelFitted = model.fit('one ~ two', samples=2000, njobs=2)
modelFitted.plot(burn_in=50)

AttributeError Traceback (most recent call last)
in ()
1 model = bambi.Model(df)
----> 2 modelFitted = model.fit('st ~ congr:fam:cued', samples=2000, njobs=2)
3 modelFitted.plot(burn_in=50)

/Library/Python/2.7/site-packages/bambi/models.pyc in fit(self, fixed, random, priors, family, link, run, categorical, **kwargs)
218 warnings.warn("Current Bayesian model has not been built yet; "
219 "building it first before sampling begins.")
--> 220 self.build()
221 return self.backend.run(**kwargs)
222

/Library/Python/2.7/site-packages/bambi/models.pyc in build(self)
172 scaler.scale(t)
173
--> 174 self.backend.build(self)
175 self.built = True
176

/Library/Python/2.7/site-packages/bambi/backends.pyc in build(self, spec, reset)
109 coef = self._build_dist(prefix + label, dist_name,
110 shape=n_cols, **dist_args)
--> 111 self.mu += pm.dot(data, coef)[:, None]
112
113 y = spec.y.data

AttributeError: 'module' object has no attribute 'dot'

All the dependencies for pymc3 seem to be satisfied (I went through pymc3's requirements.txt to make sure), yet:

print dir(pymc3)

['AR1', 'ArrayOrdering', 'Bernoulli', 'Beta', 'BetaBinomial', 'BinaryGibbsMetropolis', 'BinaryMetropolis', 'Binomial', 'Bound', 'CallableTensor', 'Categorical', 'Cauchy', 'CauchyProposal', 'ChiSquared', 'CompoundStep', 'Constant', 'ConstantDist', 'Continuous', 'DensityDist', 'Deterministic', 'DictToArrayBijection', 'DictToVarBijection', 'Dirichlet', 'Discrete', 'DiscreteUniform', 'Distribution', 'ElemwiseCategorical', 'ExGaussian', 'Exponential', 'Factor', 'Flat', 'GARCH11', 'Gamma', 'GaussianRandomWalk', 'Geometric', 'HalfCauchy', 'HalfNormal', 'HalfStudentT', 'HamiltonianMC', 'InverseGamma', 'LKJCorr', 'Laplace', 'LaplaceProposal', 'Lognormal', 'Metropolis', 'Model', 'Multinomial', 'MultivariateNormalProposal', 'MvNormal', 'MvStudentT', 'NUTS', 'NegativeBinomial', 'NoDistribution', 'Normal', 'NormalProposal', 'Pareto', 'Point', 'Poisson', 'PoissonProposal', 'Potential', 'SkewNormal', 'Slice', 'StudentT', 'StudentTpos', 'TensorType', 'Uniform', 'VonMises', 'Wald', 'Weibull', 'Wishart', 'WishartBartlett', 'ZeroInflatedNegativeBinomial', 'ZeroInflatedPoisson', 'builtins', 'doc', 'file', 'name', 'package', 'path', 'version', '_log', 'advi', 'advi_minibatch', 'approx_hessian', 'arraystep', 'autocorr', 'autocorrplot', 'autocov', 'backends', 'blocking', 'bool_types', 'bpic', 'compilef', 'complex_types', 'compound', 'cont_inputs', 'continuous_types', 'data', 'debug', 'default_type', 'df_summary', 'diagnostics', 'dic', 'discrete_types', 'distributions', 'effective_n', 'eval_univariate', 'fastfn', 'find_MAP', 'find_hessian', 'float_types', 'fn', 'forestplot', 'gelman_rubin', 'get_data_file', 'geweke', 'gibbs', 'glm', 'gradient', 'guess_scaling', 'handler', 'hessian', 'hessian_diag', 'hmc', 'hpd', 'inputvars', 'int_types', 'invlogit', 'iter_sample', 'jacobian', 'join_nonshared_inputs', 'kde2plot', 'kdeplot', 'logging', 'logit', 'logsumexp', 'loo', 'make_shared_replacements', 'math', 'mc_error', 'memoize', 'metropolis', 'model', 'modelcontext', 'np', 'nuts', 'plot_posterior', 'plots', 'quadpotential', 'quantiles', 'sample', 'sample_ppc', 'sample_vp', 'sampling', 'scaling', 'slicer', 'starting', 'stats', 'step_methods', 'summary', 'test', 'tests', 'theanof', 'trace_cov', 'trace_to_dataframe', 'traceplot', 'tuning', 'typefilter', 'variational', 'vartypes', 'waic']

Thanks in advance for your help. My platform is Mac OS 10.10.5, Python 2.7.

Summary output for two-level categorical variables

The model summary / coefficients for two-level categorical variables (whether a str or Categorical in pandas) in displays the term name but not the level / contrast generated by patsy. This makes interpreting the output somewhat more difficult.

Inspecting the model structure a bit, I see that model.terms['binary_var'].categorical == False. Binary variables do have an obvious treatment coding, but I always have trouble determining/remembering the reference level in patsy. (According to the documentation it's the 'first' level, but does that mean the first level to appear in the data frame? That can change. In R, 'first' is determined based on lexicographic ordering.)

I haven't yet tinkered with a many leveled variable to see what happens there.

Test model estimates against ground truth

A pretty big hole in our current testing approach is that right now all of our tests are just testing that models compile, run, and return the right variables. We should probably pick out a limited subset of models and test that the model estimates are close to the ground truth (where the "ground truth" comes from either analytical solutions or other packages that converge with one another). This will also be helpful as we add new backends and/or experiment with different ways of parameterizing the same models (cf. #62).

Add support for model comparison

PyMC3 has support for several model comparison methods (DIC, leave-one-out, etc.). We should wrap the existing functionality to allow comparison of Model objects.

Deal with NA values

Currently, it seems that if you have NA / NaN (dependent variable) cells, you get an error because the LHS and RHS aren't the same length. Would it be possible to do the NA filter before splitting into LHS and RHS?

No easy way to override default prior args for Y?

It looks like there's no easy mechanism for overriding the default prior args for the outcome Y. One can easily change the family through the family parameter of add_y(), but if one wanted to use, say, a Normal family with the prior SD set to something other than the default, right now it looks like the only mechanism for doing so is by passing an argument to default_priors while initializing the model. It would be easier and more intuitive if one could do this in the call to set_y(). So maybe we want to do something like give set_y() an optional prior parameter that would override the default.

Add ability to update priors after model specification

At the moment it's cumbersome to specify priors on individual terms (at least via the one-shot fitting interface), and in some cases the user might not even know the name of the term constructed by patsy in advance. We should add a set_prior method to Model that allows the user to set priors on existing terms by name. It could also have arguments like fixed, random, etc., allowing replacement of priors for the whole class of fixed or random terms.

Drop statsmodels dependency?

It turns out that pd.stats.api.ols internally calls statsmodels, so we haven't actually eliminated the statsmodels dependency. We can either switch to np.linalg.lstsq, or just explicitly call statsmodels. The pandas stats API is actually deprecated, so we'll need to change either way. I'd prefer to avoid requiring statsmodels, since we're only using it for least-squares fitting, but if you prefer its interface to numpy's for the stuff in Model.build(), @jake-westfall, that's fine.

Add ability to add interactions

Currently there's no way to add interaction terms to models; users have to take the products of columns by hands (and this is much too cumbersome in the case of categoricals). We should extend the API to handle this via, e.g., an add_interaction() method that takes a list of names of existing Terms as its main argument. It should also, by default, create terms for main effects if they don't already exist (though this will only make sense if the user wants the same priors applied to all generated terms).

Result compare to Stan

Hi there,
Great work! I am wondering if you have compared the model estimation result with the output from lme4?

A while ago I tried to compare different approaches to do a simple linear mixed model (a random intercept model with each subject as random effect). While the result using STAN (I used the brms package in R) is almost identical to the output from lme4, I am surprised that the estimation using PyMC3 is always a little bit off.

You can have a look at my attempt here.

ESCS_multiple_regression fails

The stack trace is as follows:

>>> fitted = model.fit('drugs ~ o + c + e + a + n', samples=2000, njobs=2)
$HOME/Code/bambi/bambi/models.py:227: UserWarning: Current Bayesian model has not been built yet; building it first before sampling begins.
  warnings.warn("Current Bayesian model has not been built yet; "


TypeError                                 Traceback (most recent call last)
<ipython-input-5-02e82ee54338> in <module>()
      1 model = bmb.Model(data)
----> 2 fitted = model.fit('drugs ~ o + c + e + a + n', samples=2000, njobs=2)

$HOME/Code/bambi/bambi/models.pyc in fit(self, fixed, random, priors, family, link, run, categorical, **kwargs)
    227                 warnings.warn("Current Bayesian model has not been built yet; "
    228                               "building it first before sampling begins.")
--> 229                 self.build()
    230             return self.backend.run(**kwargs)
    231 

$HOME/Code/bambi/bambi/models.pyc in build(self)
    181                 self.y.name, str(self.data[self.y.name][event])))
    182 
--> 183         self.backend.build(self)
    184         self.built = True
    185 

$HOME/Code/bambi/bambi/backends.pyc in build(self, spec, reset)
    119             y_prior.args['observed'] = y
    120             y_like = self._build_dist(
--> 121                 spec.y.name, y_prior.name, **y_prior.args)
    122 
    123             self.spec = spec

$HOME/Code/bambi/bambi/backends.pyc in _build_dist(self, label, dist, **kwargs)
     70             return v
     71 
---> 72         kwargs = {k: _expand_args(k, v, label) for (k, v) in kwargs.items()}
     73         return dist(label, **kwargs)
     74 

$HOME/Code/bambi/bambi/backends.pyc in <dictcomp>((k, v))
     70             return v
     71 
---> 72         kwargs = {k: _expand_args(k, v, label) for (k, v) in kwargs.items()}
     73         return dist(label, **kwargs)
     74 

$HOME/Code/bambi/bambi/backends.pyc in _expand_args(k, v, label)
     67             if isinstance(v, Prior):
     68                 label = '%s_%s' % (label, k)
---> 69                 return self._build_dist(label, v.name, **v.args)
     70             return v
     71 

$HOME/Code/bambi/bambi/backends.pyc in _build_dist(self, label, dist, **kwargs)
     71 
     72         kwargs = {k: _expand_args(k, v, label) for (k, v) in kwargs.items()}
---> 73         return dist(label, **kwargs)
     74 
     75     def build(self, spec, reset=True):

/usr/local/lib/python2.7/site-packages/pymc3/distributions/distribution.pyc in __new__(cls, name, *args, **kwargs)
     27             return object.__new__(cls)  # for pickle
     28         else:
---> 29             raise TypeError("needed name or None but got: " + name)
     30 
     31     def __getnewargs__(self):

I got the same error for a linear mixed model. The logistic regression example still works.

Explore alternative parameterization of random effects

Inspired by an interesting and (I think) important blog post by @twiecki. We have definitely seen this sampler behavior in several of our models.

A big implication could be that our current (traditional) parameterization of random effects may tend to overestimate the variances of random effects whose true variances are near 0. This would be particularly important for random stimulus effects, which in very many cases are less variable than random subject effects (in fact, random stimulus slopes very often have near-0 variance). This is further exacerbated by the unfortunate fact that much of psychology/neuroscience tends to use quite small stimulus samples, so that even relatively small overestimates in the stimulus variance can result in a non-trivial decrease in the precision of posteriors for the effects of interest.

It is also possible that this alternative parameterization could result in faster sampling, since presumably fewer proposals would be rejected while the sampler is exploring the low-random-effect-variance part of the parameter space... I think.

Anyway, we should do some experimentation and systematic comparison between parameterizations.

Replacement for pymc3.glm?

This looks really cool.

It seems like this by far supersedes the pymc3.glm functionality. That maybe should have been a separate package from the beginning. Do you think that's accurate? In that case, I wonder if I should get rid of pymc3.glm and instead refer to this package. Any thoughts?

family = 'binomial' error

Hi there,
I am getting a weird error after upgrading to 0.0.4.
I am modelling the accuracy of the behavioural response ([0, 1]) using:

acc_fitted = ACCmodel.fit('ACC ~ ViewCond*Race*Orit',random='1|Sbj',
                          family='binomial', samples=2000, njobs=2)

Which was fine before the upgrade but now I am getting the following error:

Traceback (most recent call last):

  File "<ipython-input-34-c6e18d364f43>", line 3, in <module>
    family='binomial', samples=2000, njobs=2)

  File "/usr/local/lib/python3.4/dist-packages/bambi/models.py", line 254, in fit
    self.build()

  File "/usr/local/lib/python3.4/dist-packages/bambi/models.py", line 206, in build
    self.y.name, str(self.data[self.y.name][event])))

  File "/usr/local/lib/python3.4/dist-packages/pandas/core/series.py", line 601, in __getitem__
    result = self.index.get_value(self, key)

  File "/usr/local/lib/python3.4/dist-packages/pandas/indexes/base.py", line 2139, in get_value
    tz=getattr(series.dtype, 'tz', None))

  File "pandas/index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas/index.c:3338)

  File "pandas/index.pyx", line 113, in pandas.index.IndexEngine.get_value (pandas/index.c:3041)

  File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)

  File "pandas/src/hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8141)

  File "pandas/src/hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8085)

KeyError: 1

Any suggestion?

Add convenient ways of specifying default priors on a model-wide level

Users can specify what prior to use for each term, but it would be good to force some explicit choice of a default prior during model construction. One way to do this would be to prevent build() from executing unless defaults have been explicitly set. Something like this would work:

model = Model(data)
# could be 'narrow', 'medium', 'uniform', etc.,
# or alternatively an actual distribution specification to use everywhere.
model.set_defaults('wide')
model.add_term(...)   # uses the default set above

The key point is to force a user to take minimal responsibility for the priors somewhere in the model specification process (though the specification doesn't have to be made explicit in terms of distributions and parameters).

Relevant new PyMC3 features

Not sure how closely you guys follow PyMC3 development but maybe two things would be of interest:

Smart initialization: pm.sample(2000) on continuous models will use ADVI to estimate a good mass matrix for NUTS to speed things up and help with bad intialization. Should be the preferred method.
A new UserModel that can be inherited, see: pymc-devs/pymc#1525, specifically https://github.com/pymc-devs/pymc3/pull/1525/files#diff-f85ae0776100e0962d318aaad0c5340fR90

Add basic support for formula-style specification

In principle, it would be easy to support formula-style model specification via either PyMC3's existing glm interface, or by way of patsy, which exposes a DesignMatrix class (alongside the returned numpy array) that tracks all of the attributes we need in order to construct Term instances.

The main drawback of the former approach is that PyMC3's glm module has no concept of terms that could span multiple columns: every column in the DataFrame produced by patsy has to be assigned its prior individually by name (unless the default can be applied everywhere), which is untenable for our purposes.

The problem with the latter approach is that patsy doesn't support specification of random effects, which are pretty critical here. I'm not sure what to do about this. We could (a) write our own parsing engine (this does not seem like a good idea to me); (b) support only the operators patsy supports, and force users to add random terms using the more explicit add_term() syntax; or (c) try some hacky thing where we initially find-and-replace on the | operator, then rely on patsy to do the heavy lifting, and then finally pass a random=True argument to the Term initializer. I lean towards (c), though I suspect it will get messy quickly once we start thinking about nested or interacting factors. Thoughts, @jake-westfall?

For categorical outcomes, tell user which event is being modeled

When the outcome is categorical, such as in logistic regression, and the outcome consists of, say, strings such as 'yes' and 'no', bambi makes some decision about which event to model (i.e., whether the estimated probabilities are P(y='yes') or P(y='no'), but this decision is silent. We should print a standard message with the model output (or even during fitting?) that lets users know which event is being modeled. If the outcome is an array of 0's and 1's, then one can probably assume that it is modeling P(y=1) (and in fact I've verified that this is true in at least one case), but even here it would be nice to be assured.

'backend' as a parameter of Model.build, not Model.init

Basically right now if we want to compare results between pymc3 and stan for the same model and data, we do something like:

pymc3_model = Model(data, backend='pymc3')
pymc3_fitted = pymc3_model.fit('y ~ x')
stan_model = Model(data, backend='stan')
stan_fitted = stan_model.fit('y ~ x')

I think something like this would be neater:

model = Model(data)
pymc3_fitted = model.fit('y ~ x', backend='pymc3')
stan_fitted = model.fit(backend='stan')

Where the argument to 'backend' here gets passed through fit() to build(). I think this would make more sense because then the model specification is totally abstracted away from the sampling method. (Well, in the second example I specified the model and the pymc3 backend in the same call, so that the abstraction is not as obvious, but to be even more clean we could first specify the model using add_formula() followed by two separate calls to fit().)

Add R-like t-test interface as a special case

To facilitate adoption and familiarize naive users with the package, we should add a package-level ttest() method that just wraps add_term() or fit() calls, and has the same arguments as R's t.test function.

Allow priors on fixed effects to be given on standardized scale

Two candidates for alternative ways to set priors on the fixed regression coefficients:

Normal or Cauchy priors on the standardized regression coefficients
Beta priors (scaled/shifted to be in [-1,1] range) on the partial correlation coefficients

Priors on the partial correlations would be a little nicer, but both would be viable solutions to the scaling problem (i.e., the same prior on a raw regression coefficient would be more or less informative depending on the scales of X and Y). Setting priors on the partial correlations could perhaps even be the default, unless the user elects to put priors directly on the regression coefficients.