Code Monkey home page Code Monkey logo

bambi's People

Contributors

agustinaarroyuelo avatar aloctavodia avatar amelio-vazquez-reina avatar camenpihor avatar canyon289 avatar ejolly avatar gstechschulte avatar gweindel avatar hectormz avatar jake-westfall avatar jt-lab avatar juanitorduz avatar julianlheureux avatar kddubey avatar krassowski avatar markgoodhead avatar mbjoseph avatar michaelosthege avatar nathanielf avatar oriolabril avatar star1327p avatar suryamudimi avatar taylorterry3 avatar thatch avatar tjburch avatar tomicapretto avatar twiecki avatar tyarkoni avatar wd60622 avatar yannmclatchie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bambi's Issues

Contrasts

We need to add support for contrasts, probably in an add_contrast() method that parallels add_term().

When adding random slopes for categorical X, drop_first if random intercept found

Users will expect a random specification like random=['1|subj','X|subj'] to work, because in most other software the random X slopes are added as K-1 dummies (K being the number of levels of X) rather than the full set of K dummies, one for each cell mean. Our current default for X|subj is to add K random cell means for each subject, but if the user has also specified a random intercept, then this is an unidentifiable model, because it attempts to estimate K+1 means for each subject, but each subject only has K means.

A reasonable solution is the following: when encountering a term like X|subj, check to see if the user also specified a random intercept for subj, i.e., 1|subj. If yes, then use the drop_first parameterization for X|subj, so that we add K-1 random dummies for each subject. If no, then use the cell means parameterization where we add K random cell means for each subject. This way we always add K parameters to the model.

Calling `add_formula` after `add_y` erases the y variable

For example, this works fine:

model.add_formula('condition + gender', random='condition|subject')
model.add_y('rt')

But reversing the order of the terms does not work:

model.add_y('rt')
model.add_formula('condition + gender', random='condition|subject')

After calling add_formula, the y attribute of the model disappears, so attempting to build the model fails with an error.

Result compare to Stan

Hi there,
Great work! I am wondering if you have compared the model estimation result with the output from lme4?

A while ago I tried to compare different approaches to do a simple linear mixed model (a random intercept model with each subject as random effect). While the result using STAN (I used the brms package in R) is almost identical to the output from lme4, I am surprised that the estimation using PyMC3 is always a little bit off.

You can have a look at my attempt here.

Implicitly add random intercept when continuous random slopes added to model

Currently adding a random term like X1|subject, where X1 is a continuous predictor, will add random X1 slopes across subjects, but will not add random subject intercepts. But users will likely expect the intercept to be added. Also, intercepts are implicitly added in the fixed part of the model, so maybe we want to be consistent.

One issue to watch out for when adding this functionality is to make sure that random intercepts for a a factor are never added twice. For example, if a previous term X1|subject implicitly added random subject intercepts, then a subsequent term X2|subject should NOT try to add random subject intercepts a second time.

Name change

This package needs a new name that doesn't imply any commitment to either (a) psychological data or (b) PyMC. Down the road, we may want to add other back-ends (e.g., Stan), so we should be maximally agnostic.

Improved ModelResults.summary()

Currently ModelResults.summary() just wraps PyMC3's summary() method. It would be good to return basic information about the model--number of terms, number of columns, description of priors used, number of samples run, collinearity diagnostics from the term _setup() calls, etc.

We may also want to suppress (or truncate) the reporting of all random effect parameter estimates by default, since they can spew a lot of output at the user.

Add support for model comparison

PyMC3 has support for several model comparison methods (DIC, leave-one-out, etc.). We should wrap the existing functionality to allow comparison of Model objects.

Plots!

Yes, those things. We need some of them.

Add convenient ways of specifying default priors on a model-wide level

Users can specify what prior to use for each term, but it would be good to force some explicit choice of a default prior during model construction. One way to do this would be to prevent build() from executing unless defaults have been explicitly set. Something like this would work:

model = Model(data)
# could be 'narrow', 'medium', 'uniform', etc.,
# or alternatively an actual distribution specification to use everywhere.
model.set_defaults('wide')
model.add_term(...)   # uses the default set above

The key point is to force a user to take minimal responsibility for the priors somewhere in the model specification process (though the specification doesn't have to be made explicit in terms of distributions and parameters).

Implement a backend-independent representation of results

Following the Model/Backend separation, we should create a ModelResult class that stores a backend-independent representation of the sampler results. For now this can be more or less a container for PyMC3 traces, as there aren't any other backends.

Call to missing pymc3 attribute (dot)

Thank you for the wonderful package. I'm running into the following issue though: model.fit() complains that there is a missing attribute in pymc3 (dot):

data = pd.DataFrame({ 'one' : pd.Series([5, 4, 3, 1]), 'two' : pd.Series([1, 2, 3, 4]) })
model = bambi.Model(data)
modelFitted = model.fit('one ~ two', samples=2000, njobs=2)
modelFitted.plot(burn_in=50)

AttributeError Traceback (most recent call last)
in ()
1 model = bambi.Model(df)
----> 2 modelFitted = model.fit('st ~ congr:fam:cued', samples=2000, njobs=2)
3 modelFitted.plot(burn_in=50)

/Library/Python/2.7/site-packages/bambi/models.pyc in fit(self, fixed, random, priors, family, link, run, categorical, **kwargs)
218 warnings.warn("Current Bayesian model has not been built yet; "
219 "building it first before sampling begins.")
--> 220 self.build()
221 return self.backend.run(**kwargs)
222

/Library/Python/2.7/site-packages/bambi/models.pyc in build(self)
172 scaler.scale(t)
173
--> 174 self.backend.build(self)
175 self.built = True
176

/Library/Python/2.7/site-packages/bambi/backends.pyc in build(self, spec, reset)
109 coef = self._build_dist(prefix + label, dist_name,
110 shape=n_cols, **dist_args)
--> 111 self.mu += pm.dot(data, coef)[:, None]
112
113 y = spec.y.data

AttributeError: 'module' object has no attribute 'dot'

All the dependencies for pymc3 seem to be satisfied (I went through pymc3's requirements.txt to make sure), yet:

print dir(pymc3)

['AR1', 'ArrayOrdering', 'Bernoulli', 'Beta', 'BetaBinomial', 'BinaryGibbsMetropolis', 'BinaryMetropolis', 'Binomial', 'Bound', 'CallableTensor', 'Categorical', 'Cauchy', 'CauchyProposal', 'ChiSquared', 'CompoundStep', 'Constant', 'ConstantDist', 'Continuous', 'DensityDist', 'Deterministic', 'DictToArrayBijection', 'DictToVarBijection', 'Dirichlet', 'Discrete', 'DiscreteUniform', 'Distribution', 'ElemwiseCategorical', 'ExGaussian', 'Exponential', 'Factor', 'Flat', 'GARCH11', 'Gamma', 'GaussianRandomWalk', 'Geometric', 'HalfCauchy', 'HalfNormal', 'HalfStudentT', 'HamiltonianMC', 'InverseGamma', 'LKJCorr', 'Laplace', 'LaplaceProposal', 'Lognormal', 'Metropolis', 'Model', 'Multinomial', 'MultivariateNormalProposal', 'MvNormal', 'MvStudentT', 'NUTS', 'NegativeBinomial', 'NoDistribution', 'Normal', 'NormalProposal', 'Pareto', 'Point', 'Poisson', 'PoissonProposal', 'Potential', 'SkewNormal', 'Slice', 'StudentT', 'StudentTpos', 'TensorType', 'Uniform', 'VonMises', 'Wald', 'Weibull', 'Wishart', 'WishartBartlett', 'ZeroInflatedNegativeBinomial', 'ZeroInflatedPoisson', 'builtins', 'doc', 'file', 'name', 'package', 'path', 'version', '_log', 'advi', 'advi_minibatch', 'approx_hessian', 'arraystep', 'autocorr', 'autocorrplot', 'autocov', 'backends', 'blocking', 'bool_types', 'bpic', 'compilef', 'complex_types', 'compound', 'cont_inputs', 'continuous_types', 'data', 'debug', 'default_type', 'df_summary', 'diagnostics', 'dic', 'discrete_types', 'distributions', 'effective_n', 'eval_univariate', 'fastfn', 'find_MAP', 'find_hessian', 'float_types', 'fn', 'forestplot', 'gelman_rubin', 'get_data_file', 'geweke', 'gibbs', 'glm', 'gradient', 'guess_scaling', 'handler', 'hessian', 'hessian_diag', 'hmc', 'hpd', 'inputvars', 'int_types', 'invlogit', 'iter_sample', 'jacobian', 'join_nonshared_inputs', 'kde2plot', 'kdeplot', 'logging', 'logit', 'logsumexp', 'loo', 'make_shared_replacements', 'math', 'mc_error', 'memoize', 'metropolis', 'model', 'modelcontext', 'np', 'nuts', 'plot_posterior', 'plots', 'quadpotential', 'quantiles', 'sample', 'sample_ppc', 'sample_vp', 'sampling', 'scaling', 'slicer', 'starting', 'stats', 'step_methods', 'summary', 'test', 'tests', 'theanof', 'trace_cov', 'trace_to_dataframe', 'traceplot', 'tuning', 'typefilter', 'variational', 'vartypes', 'waic']

Thanks in advance for your help. My platform is Mac OS 10.10.5, Python 2.7.

Rename level names when returning categorical variables

Patsy will rename categorical variables sequentially, but it's helpful to return results to the user with the original level values for interpretability's sake. By default, we would ideally back-project extracted column names to the original level names in the data passed in. This may or may not be easily feasible given the current implementation, but we should take a look.

Throw error when symbols like + given in random terms

The lme4 package in R supports random terms like (X1+X2|subject) which it interprets as (X1|subject) + (X2|subject), so users will likely assume this same behavior is true in bambi. But it isn't true in bambi. Not clear whether we WANT to support such syntax, so easiest solution is just to search random strings for unexpected symbols like '+' and throw an error if found.

Deal with NA values

Currently, it seems that if you have NA / NaN (dependent variable) cells, you get an error because the LHS and RHS aren't the same length. Would it be possible to do the NA filter before splitting into LHS and RHS?

'backend' as a parameter of Model.build, not Model.__init__

Basically right now if we want to compare results between pymc3 and stan for the same model and data, we do something like:

pymc3_model = Model(data, backend='pymc3')
pymc3_fitted = pymc3_model.fit('y ~ x')
stan_model = Model(data, backend='stan')
stan_fitted = stan_model.fit('y ~ x')

I think something like this would be neater:

model = Model(data)
pymc3_fitted = model.fit('y ~ x', backend='pymc3')
stan_fitted = model.fit(backend='stan')

Where the argument to 'backend' here gets passed through fit() to build(). I think this would make more sense because then the model specification is totally abstracted away from the sampling method. (Well, in the second example I specified the model and the pymc3 backend in the same call, so that the abstraction is not as obvious, but to be even more clean we could first specify the model using add_formula() followed by two separate calls to fit().)

Add ability to update priors after model specification

At the moment it's cumbersome to specify priors on individual terms (at least via the one-shot fitting interface), and in some cases the user might not even know the name of the term constructed by patsy in advance. We should add a set_prior method to Model that allows the user to set priors on existing terms by name. It could also have arguments like fixed, random, etc., allowing replacement of priors for the whole class of fixed or random terms.

Allow only random effects

While unlikely, a user could conceivably want to include only random effects in a model. Currently the fixed argument is mandatory in add_formula; it should be made optional.

Add R-like t-test interface as a special case

To facilitate adoption and familiarize naive users with the package, we should add a package-level ttest() method that just wraps add_term() or fit() calls, and has the same arguments as R's t.test function.

Removable discontinuity in slope = f(corr)

Our default priors are based in part on derivatives of a function that maps the partial correlation to the implied slope. But this function has a removable discontinuity at 0 (due to the sign function), so the derivative does not exist at exactly the point where we almost always want to evaluate it. Right now we get around this by just evaluating the derivative at a point very close to 0. This mostly works well, although we have to evaluative it a bit further from 0 for increasingly higher-order approximations or the results get weird -- I worked out what seem to be good distances from 0 for different levels of approximation just by trial and error, but this would all be a lot nicer if we could just deal with the removable discontinuity directly and then just take the derivative at 0. Some initial attempts to deal with this using limit() in Sympy didn't seem to work, but we should return to this issue eventually and see if we can make it work.

Specify burn-in at ModelResults class level

Currently, burn_in is strictly a method-level parameter (in .plot and .summary), but for convenience and consistency, a user should be able to set it on the ModelResults instance—preferably propagated from the Model.run() call. Then the burn_in argument can default to None instead of 0, and check for a value in the instance before falling back on 0.

No easy way to override default prior args for Y?

It looks like there's no easy mechanism for overriding the default prior args for the outcome Y. One can easily change the family through the family parameter of add_y(), but if one wanted to use, say, a Normal family with the prior SD set to something other than the default, right now it looks like the only mechanism for doing so is by passing an argument to default_priors while initializing the model. It would be easier and more intuitive if one could do this in the call to set_y(). So maybe we want to do something like give set_y() an optional prior parameter that would override the default.

Allow priors on fixed effects to be given on standardized scale

Two candidates for alternative ways to set priors on the fixed regression coefficients:

  • Normal or Cauchy priors on the standardized regression coefficients
  • Beta priors (scaled/shifted to be in [-1,1] range) on the partial correlation coefficients

Priors on the partial correlations would be a little nicer, but both would be viable solutions to the scaling problem (i.e., the same prior on a raw regression coefficient would be more or less informative depending on the scales of X and Y). Setting priors on the partial correlations could perhaps even be the default, unless the user elects to put priors directly on the regression coefficients.

In add_term interface, adding random slopes overwrites corresponding fixed slope

Here is a random slopes model specified both via the formula interface and the add_term interface:

# build model using formula
modelA = Model(crossed_data)
modelA.fit('Y ~ continuous', random=['continuous|subj'], run=False)
modelA.build()

# build model using add_term
modelB = Model(crossed_data)
modelB.add_y('Y')
modelB.add_intercept()
modelB.add_term('continuous', random=False)
modelB.add_term('subj', random=True)
modelB.add_term('continuous', split_by='subj', random=True)
modelB.build()

The model specified via the formula interface ends up with the expected terms:

modelA.terms
# OrderedDict([('Intercept', <bambi.models.Term at 0x196889c50>),
#              ('continuous', <bambi.models.Term at 0x1968864a8>),
#              ('subj', <bambi.models.RandomTerm at 0x19683a908>),
#              ('continuous|subj', <bambi.models.RandomTerm at 0x19686f438>)])

But the model specified via add_term, which I believe should be the equivalent model, has lost the fixed slope:

modelB.terms
# OrderedDict([('Intercept', <bambi.models.Term at 0x19686f0b8>),
#              ('continuous', <bambi.models.RandomTerm at 0x196950358>),
#              ('subj', <bambi.models.RandomTerm at 0x196950ef0>)])

The random terms labels don't match, as you've noted elsewhere, but that's not really a big deal because the terms themselves are the same. Anyway, this problem likely has something to do with the fact that the label for the random slopes coincides with the label for the corresponding fixed slope.

For categorical outcomes, tell user which event is being modeled

When the outcome is categorical, such as in logistic regression, and the outcome consists of, say, strings such as 'yes' and 'no', bambi makes some decision about which event to model (i.e., whether the estimated probabilities are P(y='yes') or P(y='no'), but this decision is silent. We should print a standard message with the model output (or even during fitting?) that lets users know which event is being modeled. If the outcome is an array of 0's and 1's, then one can probably assume that it is modeling P(y=1) (and in fact I've verified that this is true in at least one case), but even here it would be nice to be assured.

Add tests for the full range of models we support

Adding tests for different models should be a high priority. I don't think we need to check for numerical correctness of output just yet (though we'll get there), but we do want to make sure the number of parameters we get back for each term is right, etc. We should add at least one test for every major permutation of model we can support (e.g., fixed effects alone; fixed effects with one random intercept term; crossed random effects; etc.).

I think these can probably be specified with the formula syntax for now, though I think a better approach would ultimately be to specify the same model using both the high and low-level APIs, and then check for equivalency of results (in addition the checking that the trace looks right).

Plot or summarize priors

Since the priors are now adapted to the data, we should have some way of showing the user what the parameters actually were. Either a plot_priors() method comparable to the kdeplots produced by traceplot() or some kind of summarize_priors() that produces a table of distribution names and parameters would do the trick.

Test model estimates against ground truth

A pretty big hole in our current testing approach is that right now all of our tests are just testing that models compile, run, and return the right variables. We should probably pick out a limited subset of models and test that the model estimates are close to the ground truth (where the "ground truth" comes from either analytical solutions or other packages that converge with one another). This will also be helpful as we add new backends and/or experiment with different ways of parameterizing the same models (cf. #62).

Replacement for pymc3.glm?

This looks really cool.

It seems like this by far supersedes the pymc3.glm functionality. That maybe should have been a separate package from the beginning. Do you think that's accurate? In that case, I wonder if I should get rid of pymc3.glm and instead refer to this package. Any thoughts?

Extend support from LMMs to GLMMs

Specifically, add support for the following distributions of Y (beyond Gaussian):

  • Binomial
  • Poisson

And for the following link functions (beyond the identity function):

  • Logit
  • Probit
  • Log

Add basic support for formula-style specification

In principle, it would be easy to support formula-style model specification via either PyMC3's existing glm interface, or by way of patsy, which exposes a DesignMatrix class (alongside the returned numpy array) that tracks all of the attributes we need in order to construct Term instances.

The main drawback of the former approach is that PyMC3's glm module has no concept of terms that could span multiple columns: every column in the DataFrame produced by patsy has to be assigned its prior individually by name (unless the default can be applied everywhere), which is untenable for our purposes.

The problem with the latter approach is that patsy doesn't support specification of random effects, which are pretty critical here. I'm not sure what to do about this. We could (a) write our own parsing engine (this does not seem like a good idea to me); (b) support only the operators patsy supports, and force users to add random terms using the more explicit add_term() syntax; or (c) try some hacky thing where we initially find-and-replace on the | operator, then rely on patsy to do the heavy lifting, and then finally pass a random=True argument to the Term initializer. I lean towards (c), though I suspect it will get messy quickly once we start thinking about nested or interacting factors. Thoughts, @jake-westfall?

family = 'binomial' error

Hi there,
I am getting a weird error after upgrading to 0.0.4.
I am modelling the accuracy of the behavioural response ([0, 1]) using:

acc_fitted = ACCmodel.fit('ACC ~ ViewCond*Race*Orit',random='1|Sbj',
                          family='binomial', samples=2000, njobs=2)

Which was fine before the upgrade but now I am getting the following error:

Traceback (most recent call last):

  File "<ipython-input-34-c6e18d364f43>", line 3, in <module>
    family='binomial', samples=2000, njobs=2)

  File "/usr/local/lib/python3.4/dist-packages/bambi/models.py", line 254, in fit
    self.build()

  File "/usr/local/lib/python3.4/dist-packages/bambi/models.py", line 206, in build
    self.y.name, str(self.data[self.y.name][event])))

  File "/usr/local/lib/python3.4/dist-packages/pandas/core/series.py", line 601, in __getitem__
    result = self.index.get_value(self, key)

  File "/usr/local/lib/python3.4/dist-packages/pandas/indexes/base.py", line 2139, in get_value
    tz=getattr(series.dtype, 'tz', None))

  File "pandas/index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas/index.c:3338)

  File "pandas/index.pyx", line 113, in pandas.index.IndexEngine.get_value (pandas/index.c:3041)

  File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)

  File "pandas/src/hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8141)

  File "pandas/src/hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8085)

KeyError: 1

Any suggestion?

Intercept default prior has SD=0 when all predictors are centered

This is not a bug exactly, but an unanticipated consequence of how we determine the default prior for the intercept (the logic is explained in this notebook).

It actually sort of makes sense when you think about it: in an OLS context, if all the predictors are centered, then we know that the estimate for the intercept will be the sample mean of Y. Consistent with this, our defaults assign a prior with mean = mean(Y) and sd = 0 when all the predictors are centered. But this is not really desirable in practice because, among other things, it causes computational problems when trying to run the model.

One way of viewing this is that the problem essentially arises because we treat mean(Y) as a fixed, known quantity, when in fact Y is a function of the random data, so there ought to be uncertainty about mean(Y) even if all the predictors are centered. We know that the OLS estimate of the intercept will equal the sample mean of Y, but this obviously does not imply that we are certain about the population mean of Y, which is what our default method kind of assumes. So a simple fix would be to add var(mean(Y)) or var(Y) to the variance of the intercept prior -- the former option is more technically correct, but could lead to a really narrow prior when there's a lot of data, which could conceivably have other negative consequences, so the latter option could possibly work better in practice, despite being perhaps less principled.

Drop statsmodels dependency?

It turns out that pd.stats.api.ols internally calls statsmodels, so we haven't actually eliminated the statsmodels dependency. We can either switch to np.linalg.lstsq, or just explicitly call statsmodels. The pandas stats API is actually deprecated, so we'll need to change either way. I'd prefer to avoid requiring statsmodels, since we're only using it for least-squares fitting, but if you prefer its interface to numpy's for the stuff in Model.build(), @jake-westfall, that's fine.

Modified default priors for non-normal models

The default priors make good sense for normal models, but are not designed to work out-of-the-box with non-normal models. There are two complications that I can see: defining SD(Y) and defining dm_statistics['r2_y'], which is the R^2 from regressing Y on all the other Xs.

For SD(Y), the value we use should be something that makes sense on the link scale, not necessarily on the observed scale. Commit 98a4a45 introduces an interim solution that handles the binomial/logit and binomial/probit cases (for when the latter is implemented). In those models the variability of Y on the link scale is not identifiable over and above the slopes, so I fix the constant to a value based on setting the variance of the latent logistic or normal variable to 1 (see section 3.5 here for a similar approach). All other family/link combinations are currently just left as-is, i.e., using the observed SD(Y), so that all the models will still run. And this default should be fine when family=normal or link=identity, but for other cases we should think about what sensible defaults are.

For r2_y, this currently uses the classical R^2, but this only really makes sense when Y is normal. For example, if Y is binomial, then really this should be based on a logistic regression of Y on all the other Xs. One thing that could work is to just replace this with a pseudo-R^2 statistic in the non-normal case. But as of commit 98a4a45 we are still just using the classical R^2.

I think the general principle we want to shoot for is that the resulting interpretation of the prior SD would always be in terms of partial correlations between the predictors and the latent outcome in the case of GLM(M)s. I think for binomial/logit models this is accomplished by the current SD(Y) fix + replacing r2_y with a pseudo-R^2. But we should think about how to accomplish this for other family/link combinations.

Add informative message when prior scaling fails due to non-identifiable fixed effects

When a model with default priors and non-identifiable fixed effects is created (e.g., by calling something like add_term('condition', drop_first=False) when an intercept is already in the model), the prior scaling approach currently implemented in PriorScaler sets the random factor hyperprior variances to inf, and the sampler consequently refuses to start. We should either check for this scenario and substitute some large but sane value, or raise an exception with a more informative message rather than waiting for Theano to complain.

Incremental sampling

We should add the ability to sample incrementally (i.e., to restart the sampler with the last sample as the new starting point, and append new samples to the existing MultiTrace).

Add ability to add interactions

Currently there's no way to add interaction terms to models; users have to take the products of columns by hands (and this is much too cumbersome in the case of categoricals). We should extend the API to handle this via, e.g., an add_interaction() method that takes a list of names of existing Terms as its main argument. It should also, by default, create terms for main effects if they don't already exist (though this will only make sense if the user wants the same priors applied to all generated terms).

For categorical X, internally change random terms from `X|subj` to `subj|X`

For adding random slopes for categorical predictors, the usual syntax (e.g., in lme4) is to say X|subj, where X is the predictor and subj is the random factor. Right now this does not give the expected behavior, as it does not add separate random subject terms for each level of X. A simple fix is to internally change this to subj|X. The key is to make this change internal, because in the usual syntax that users will be familiar with, subj|X doesn't really make sense. So instead we should let users write X|subj, but internally we understand this as subj|X,

Docstrings

The code is pretty poorly documented at the moment. We need to make sure every public method has a detailed docstring. Once that's done, we can think about adding Sphinx support.

Summary output for two-level categorical variables

The model summary / coefficients for two-level categorical variables (whether a str or Categorical in pandas) in displays the term name but not the level / contrast generated by patsy. This makes interpreting the output somewhat more difficult.

Inspecting the model structure a bit, I see that model.terms['binary_var'].categorical == False. Binary variables do have an obvious treatment coding, but I always have trouble determining/remembering the reference level in patsy. (According to the documentation it's the 'first' level, but does that mean the first level to appear in the data frame? That can change. In R, 'first' is determined based on lexicographic ordering.)

I haven't yet tinkered with a many leveled variable to see what happens there.

Explore alternative parameterization of random effects

Inspired by an interesting and (I think) important blog post by @twiecki. We have definitely seen this sampler behavior in several of our models.

A big implication could be that our current (traditional) parameterization of random effects may tend to overestimate the variances of random effects whose true variances are near 0. This would be particularly important for random stimulus effects, which in very many cases are less variable than random subject effects (in fact, random stimulus slopes very often have near-0 variance). This is further exacerbated by the unfortunate fact that much of psychology/neuroscience tends to use quite small stimulus samples, so that even relatively small overestimates in the stimulus variance can result in a non-trivial decrease in the precision of posteriors for the effects of interest.

It is also possible that this alternative parameterization could result in faster sampling, since presumably fewer proposals would be rejected while the sampler is exploring the low-random-effect-variance part of the parameter space... I think.

Anyway, we should do some experimentation and systematic comparison between parameterizations.

ESCS_multiple_regression fails

The stack trace is as follows:

>>> fitted = model.fit('drugs ~ o + c + e + a + n', samples=2000, njobs=2)
$HOME/Code/bambi/bambi/models.py:227: UserWarning: Current Bayesian model has not been built yet; building it first before sampling begins.
  warnings.warn("Current Bayesian model has not been built yet; "


TypeError                                 Traceback (most recent call last)
<ipython-input-5-02e82ee54338> in <module>()
      1 model = bmb.Model(data)
----> 2 fitted = model.fit('drugs ~ o + c + e + a + n', samples=2000, njobs=2)

$HOME/Code/bambi/bambi/models.pyc in fit(self, fixed, random, priors, family, link, run, categorical, **kwargs)
    227                 warnings.warn("Current Bayesian model has not been built yet; "
    228                               "building it first before sampling begins.")
--> 229                 self.build()
    230             return self.backend.run(**kwargs)
    231 

$HOME/Code/bambi/bambi/models.pyc in build(self)
    181                 self.y.name, str(self.data[self.y.name][event])))
    182 
--> 183         self.backend.build(self)
    184         self.built = True
    185 

$HOME/Code/bambi/bambi/backends.pyc in build(self, spec, reset)
    119             y_prior.args['observed'] = y
    120             y_like = self._build_dist(
--> 121                 spec.y.name, y_prior.name, **y_prior.args)
    122 
    123             self.spec = spec

$HOME/Code/bambi/bambi/backends.pyc in _build_dist(self, label, dist, **kwargs)
     70             return v
     71 
---> 72         kwargs = {k: _expand_args(k, v, label) for (k, v) in kwargs.items()}
     73         return dist(label, **kwargs)
     74 

$HOME/Code/bambi/bambi/backends.pyc in <dictcomp>((k, v))
     70             return v
     71 
---> 72         kwargs = {k: _expand_args(k, v, label) for (k, v) in kwargs.items()}
     73         return dist(label, **kwargs)
     74 

$HOME/Code/bambi/bambi/backends.pyc in _expand_args(k, v, label)
     67             if isinstance(v, Prior):
     68                 label = '%s_%s' % (label, k)
---> 69                 return self._build_dist(label, v.name, **v.args)
     70             return v
     71 

$HOME/Code/bambi/bambi/backends.pyc in _build_dist(self, label, dist, **kwargs)
     71 
     72         kwargs = {k: _expand_args(k, v, label) for (k, v) in kwargs.items()}
---> 73         return dist(label, **kwargs)
     74 
     75     def build(self, spec, reset=True):

/usr/local/lib/python2.7/site-packages/pymc3/distributions/distribution.pyc in __new__(cls, name, *args, **kwargs)
     27             return object.__new__(cls)  # for pickle
     28         else:
---> 29             raise TypeError("needed name or None but got: " + name)
     30 
     31     def __getnewargs__(self):

I got the same error for a linear mixed model. The logistic regression example still works.

Better specification of random effects

Currently random effects must be specified by passing a list that only accepts the pipe operator, and none of the other standard formula operators (e.g., 'subject|group' is fine; '(1+subject)|group' breaks). In the long term we should eventually handle all the operators accepted by lme4, but a shorter-term solution is to check that no invalid operators are passed, and raise an error message if any are detected (e.g., '1+subject|group' will raise an exception). This ensures that people don't run into unexpected behavior where the model runs without complaint, but fits the wrong model.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.