askurz / statistical_rethinking_with_brms_ggplot2_and_the_tidyverse Goto Github PK

The bookdown version lives here: https://bookdown.org/content/3890

License: GNU General Public License v3.0

CSS 0.78% JavaScript 0.56% HTML 96.83% TeX 1.83%

bayesian-inference brms tidyverse rethinking regression

statistical_rethinking_with_brms_ggplot2_and_the_tidyverse's Introduction

Statistical rethinking with brms, ggplot2, and the tidyverse

I love McElreath's Statistical rethinking text. However, I've come to prefer using Bürkner’s brms package when doing Bayesian regression in R. It's just spectacular. I also prefer plotting with Wickham's ggplot2, and using tidyverse-style syntax (which you might learn about here or here).

So, this project is an attempt to reexpress the code in McElreath's textbook. His models are re-fit in brms, plots are reproduced or reimagined with ggplot2, and the general data wrangling code now predominantly follows the tidyverse style.

The current version, 1.3.0, is available in HTML only. However, version 1.0.1 is still available as a PDF version.

This repository contains the code and text behind the Statistical rethinking with brms, ggplot2, and the tidyverse project.

The project was stitched together using Yihui Xie's bookdown package.

statistical_rethinking_with_brms_ggplot2_and_the_tidyverse's People

Contributors

Stargazers

Watchers

Forkers

nemochina2008 mikedecr ogletrees dizyd jgmac1106 deerluluolivia ytlogos favstats humanfactors sidmehta1975 federicoandreis anhnguyendepocen elsherbini eipi10 iron0012 hammer dwinkler1 amizrob mcmaurer gavin-ng keesterbrugge ankj snowdj htnani cogneuro rpruim jmgirard henrikeckermann mathematiguy xxxpipxxx mashedpoteto npp97 paul-yuchao-dong maxgav13 petermacp zhujiedong gwd999 maj-biostat lewangecon robindang0573 wolski rmadillo hrgentry federicomarini blacng guhjy nliced zhuang4270 jonspring stevenworthington emmanuel-r8 mingrisch m-macaskill jameshwade geoclick yhw1234 karthy257 primulalhd askebredahl tpfeeney lejarx wjakethompson stzhao daaronr josemreis jerrywho jwingnut clrafaelr davan690 reyntjesr wsteenbeek perlatex srravula1 dallak newtpm pranavesh-panakkal haroon123 arrendi nicholaskarlson delmurphy wangdata arvindvenkatadri margosolo r-tutorials emmamills72 owain-s artlesshao sielerjm tpetric7 l-gorman

statistical_rethinking_with_brms_ggplot2_and_the_tidyverse's Issues

Section 7.5 typo: water.c -> water_c

I am really enjoying your work! It's such a nice source to learn both tidyverse syntax and brms. It's especially inspirational to see another psychologist's work here (I'm majoring in counseling psychology, btw).

Just a minor typo to report. In the last part of the section 7.5, 'water.c' needs to be 'water_c,' which will generate the intended plot.

Chapter 3 typo - n_tirals -> n_trials

I know nothing about bookdown or I would just make a pull request. In multiple places in the code in chapter 3 there is an n_tirals variable, which should presumably be n_trials.

Figure 7.7.

I can't seem to get the name to work dynamically (e.g., paste("fig", w, sep = "_") returns an error). I would like to code this such that the loop returns three objects fig_-1, fig_0 and fig_1, which I could save or put into the multiplot() function.

Here's my curent attempt:

# loop over values of waterC and plot predictions
shade.seq <- -1:1

for(w in -1:1){
  # defining the subset of the original data
  dt <- d[d$water.c == w, ]
  # defining our new data
  nd <- tibble(water.c = w, shade.c = shade.seq)
  # using our sampling skills, like before
  fit.7.9 <- fitted(b7.9, newdata = nd) %>%
    as_tibble() %>%
    bind_cols(nd)
  
  # specifying our custom plot
  fig <- ggplot() + 
    theme_pander() + 
    geom_ribbon(data = fit.7.9, 
                aes(x = shade.c,
                    ymin = `2.5%ile`,
                    ymax = `97.5%ile`), 
                fill = "#CC79A7", alpha = 1/5) +
    geom_line(data = fit.7.9, aes(x = shade.c, y = Estimate), 
              color = "#CC79A7") +
    geom_point(data = dt, aes(x = shade.c, y = blooms),
               color = "#CC79A7") +
    coord_cartesian(xlim = c(-1, 1), ylim = c(0, 350)) +
    scale_x_continuous(breaks = c(-1, 0, 1)) +
    labs(x = "Shade (centered)", y = "Blooms", 
         title = paste("Water (centered) =", w)) +
    theme(text = element_text(family = "Times"))
  
  # plotting that joint
  plot(fig)
}

Model 14.1.

Not sure how to get the div_est parameters from brms the way McElreath did for his model 14.1. (page 428).

effective samples

Introduce the update the effective samples from brms 2.10.0

11.1.2, check the prose after model 11.1
12.2.5, update the prose after the neff_ratio() block
13.3, update the prose after the b13.6 block and in the plot (i.e., the ylab) that follows

start using `tidyr::crossing()`

replace tibble() %>% expand() with crossing()

3.3.1, sixth code block
4.3.3, first code block
10.2.1, second to last code block
11.3.1, fifth code block

model 10.16

Part of the key to understanding m10.16 is this little sentence at the top of page 325: "Notice that there are no intercepts in the linear model." McElreath used a nonlinear model. It may take the brms nonlinear syntax to fit this with brm().

Section 3.2.2 normalizing constant typo (?)

Thank you very much for this wonderful project! It is very helpful.

One question in section 3.2.2:

Should it be mutate(posterior = (likelihood * prior) / sum(likelihood * prior)) instead, or mutate(posterior = likelihood * prior, posterior = posterior / sum(posterior))?

Exercise 12M3 help

Not an issue per se, so please close if this is not an appropriate venue to ask for help.

I'm having trouble implementing Exercise 12M3 in brms. Here's what I've done using the rethinking code:

 map2stan(
  alist(
    surv ~ dbinom(density, p) ,
    logit(p) <- alpha[tank],
    alpha[tank] ~ dcauchy(m, shape),
    m ~ dnorm(0, 10),
    shape ~ dcauchy(0, 1)
  ), data=chimpanzees )

But I can't figure how to apply alpha[tank] ~ dcauchy(m, shape), in brms. I was wondering if I should use the argument family = binomial("cauchit") within brm but it's not giving similar results. FWIW, I was able to replicate the results in greta with the following code:

a <- normal(0, 1)
sigma <- cauchy(0, 1, truncation = c(0, Inf))
alpha_cauch <- cauchy(a, sigma, dim = max(reedfrog$tank))
p_cauch <- ilogit(alpha_cauch)
distribution(reedfrog$surv) <- binomial(reedfrog$density, p_cauch)

m12greta_cauch <- model(alpha_cauch)
m12m3_cauch_samples <- mcmc(m12greta_cauch)
summary(m12m3_cauch_samples)

Model 13.7.

I’m not proficient enough with Gaussian process models to come up with the brms equivalent of McElreath’s:

m13.7 <- map2stan(
    alist(
        total_tools ~ dpois(lambda),
        log(lambda) <- a + g[society] + bp*logpop,
        g[society] ~ GPL2( Dmat , etasq , rhosq , 0.01 ),
        a ~ dnorm(0,10),
        bp ~ dnorm(0,1),
        etasq ~ dcauchy(0,1),
        rhosq ~ dcauchy(0,1)
    ),
    data=list(
        total_tools=d$total_tools,
        logpop=d$logpop,
        society=d$society,
        Dmat=islandsDistMatrix),
    warmup=2000 , iter=1e4 , chains=4 )

urbnmapr

replace urbnmapr in Chapter 5

Errata

Check McElreath's Errata.

softmax

Add the correction for the first softmax model, b10.16, by integrating the material in the https://github.com/ASKurz/Statistical_Rethinking_with_brms_ggplot2_and_the_tidyverse/blob/master/in_the_works/10.3.1_mn.Rmd file.

embed_url()

In Bonus Section 13.5, embed the actual lecture presentation in the ebook with the following code:

vembedr::embed_url("https://youtu.be/rSQ1XMwO_9A?t=343") %>%
  vembedr::use_align("center")

While you're at it, consider throwing in a visualization of the data, too. Maybe something like this, which is based on the second plot in this blog post.

funding %>% 
  mutate(p = awards / applications) %>% 
  
  ggplot(aes(x = p, y = reorder(discipline, p))) +
  geom_point(aes(color = gender, size = applications)) +
  scale_size_continuous(breaks = NULL) +
  scale_color_manual(NULL, values = c("#A65141", "#80A0C7")) +
  scale_x_continuous("award probability", limits = c(0, .5)) +
  ylab(NULL) +
  theme_pearl_earring(axis.text.y = element_text(hjust = 0),
                      axis.ticks.y = element_blank()) +
  theme(legend.position = c(.9, .8))

Model 14.2

Unsure of how to set up inits in brms to match the way McElreath did for his model 14.2. Here’s McElreath’s code:

dlist <- list(
    div_obs=d$Divorce,
    div_sd=d$Divorce.SE,
    mar_obs=d$Marriage,
    mar_sd=d$Marriage.SE,
    A=d$MedianAgeMarriage )

m14.2 <- map2stan(
    alist(
        div_est ~ dnorm(mu,sigma),
        mu <- a + bA*A + bR*mar_est[i],
        div_obs ~ dnorm(div_est,div_sd),
        mar_obs ~ dnorm(mar_est,mar_sd),
        a ~ dnorm(0,10),
        bA ~ dnorm(0,10),
        bR ~ dnorm(0,10),
        sigma ~ dcauchy(0,2.5)
), data=dlist , start=list(div_est=dlist$div_obs,mar_est=dlist$mar_obs) , WAIC=FALSE , iter=5000 , warmup=1000 , chains=3 , cores=3 , control=list(adapt_delta=0.95) )

Unclear if it’s connected to the issue, above, but the regularization in the brms model b14.2 was much more aggressive than McElreath’s. It’s very apparent when you compare my version of Figure 14.4.b. to that in the text.

expose_functions() does not work

Line 1135 of 11.Rmd, the code:

expose_functions(b11.5, vectorize = TRUE)
Produced error message of:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘get_stancode’ for signature ‘"stanmodel"’
I am using rstan 2.18, which I compiled myself. This might be the problem?

Model 13.3

It’s unclear to me why the varying intercepts varies more greatly for the brms solution than McElreath’s:

m13.3 <- map2stan(
    alist(
        admit ~ dbinom( applications , p ),
        logit(p) <- a_dept[dept_id] +
                    bm_dept[dept_id]*male,
        c(a_dept,bm_dept)[dept_id] ~ dmvnorm2( c(a,bm) , sigma_dept , Rho ),
        a ~ dnorm(0,10),
        bm ~ dnorm(0,1),
        sigma_dept ~ dcauchy(0,2),
        Rho ~ dlkjcorr(2)
), data=d , warmup=1000 , iter=5000 , chains=4 , cores=3 )

It’s really apparent if you compare the output of my coefficients plot to his:

plot( precis(m13.3,pars=c("a_dept","bm_dept"),depth=2) )

Post-treatment bias clarity

In Section 5.3.3, define the h0 variable before the treatment and fungus variables in the d data frame. This should help clarify the temporal sequence. Somewhere in the prose, you might also mention this follows an example from Cochran (1957, Section 2.3; https://doi.org/10.2307/2527916), which was also discussed in Rosenbaum (1984, Section 3.3; https://doi.org/10.2307/2981697).

Figure 10.8.b.

Unsure how to elegantly insert the marginal posterior estimates as dot-and-line plots. I don't believe McElreath shared the code for his version in the text (page 316).

A minor ggplot() code update

Hi,
Thoroughly enjoying working through this book and just wanted to out a very minor thing:
I was getting "Error in zero_range(range): x must be length 1 or 2" in when plotting the 4 steps, 8 steps and 16 steps plots in Ch. 4.1.1. This is fixed by changing, coord_cartesian(xlim = -6:6), to, coord_cartesian(xlim = c(-6,6)), for the three plots.
I know it's a very minor point but just wanted to mention it for keeping code up to date.
Thanks for the great resource!

gp()

Correct the Gaussian process section, 13.4, based on the https://github.com/ASKurz/Statistical_Rethinking_with_brms_ggplot2_and_the_tidyverse/blob/master/in_the_works/13.4_gp_redo.Rmd file.

Section 4.4.3.4 typo

Another minor typo to report.

At Section 4.4.3.4., It says, "With fitted(), it’s quite easy to plot a regression line and its intervals. Just omit the summary = T argument." I believe "summary = T" should be "summary = F".

Thanks!

Model 10.16

Have been unable to figure out the brms equivalent to McElreath’s:

m10.16 <- map(
alist( 
    career ~ dcategorical( softmax(0,s2,s3) ),
    s2 <- b*2,    # linear model for event type 2
    s3 <- b*3,    # linear model for event type 3
    b ~ dnorm(0,5)
), data=list(career=career) )

Error when building the index.Rmd

The error is.
Should the _main.Rmd be part of the repo?

The file _main.Rmd exists. Please delete it if it was automatically generated, or set a different book_filename option in _bookdown.yml.

typos

Intro page: "[It's] flexible, uses reasonably-approachable syntax, has sensible defaults, and offers a vast array of post-processing convenience functions."
Switch out references for Legler & Roback (2019) to Roback & Legler (2021).

request: embed hypothes.is

Embedding hypothes.is for collaborative annotation would be a plus. Of course people with the plugin can do it themselves, but embedding this

encourages people to do this collab annotation, which is super helpful and can help improve the resource
makes this much easier to do from mobile devices, including fancy e-readers

This can be done on bookdowns like I did here and on Quartos like I did here

Yihui offers some discussion of how to add this to bookdowns -- it's not hard. With Quarto it's even easier.

Potential simplifications of some parameter name munging

Love the project and also the recent post on meta-analysis. As I was reading through that post, it occurred to me that you could potentially simplify some of the parameter name munging by using tidybayes::spread_draws, tidybayes::gather_draws, or tidybayes::add_fitted_draws. E.g., this plot from your blog post on meta-analysis:

# We’ll use this custom function to add the grand mean to the group-specific deviations
add_overall_effect <- function(i){
  posterior_samples(b14.6)[, "b_Intercept"] + i
}

# load tidybayes
library(tidybayes)
posterior_samples(b14.6) %>% 
  select(starts_with("r_outcome")) %>% 
  mutate_all(add_overall_effect) %>% 
  gather(key, mu) %>% 
  # these two lines clean up the text within `key`
  mutate(key = str_remove(key, "r_outcome\\[") %>% str_remove(., ",Intercept\\]")) %>% 
  mutate(key = str_replace_all(key, "[.]", " ")) %>% 
  
  # plot
  ggplot(aes(x = mu, y = reorder(key, mu))) +
  geom_halfeyeh(point_interval = median_qi, .width = .95, size = 2/3) +
  labs(x = expression(italic("Cohen's d")),
       y = NULL) +
  theme(panel.grid   = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.y  = element_text(hjust = 0))

You could save the work munging names and lining up the global intercept with the group-level offsets by using tidybayes::spread_draws:

library(tidybayes)
b14.6 %>%
  spread_draws(b_Intercept, r_outcome[outcome,]) %>%
  mutate(mu = b_Intercept + r_outcome) %>%
  ungroup() %>%
  mutate(outcome = str_replace_all(outcome, "[.]", " ")) %>% 

  # plot
  ggplot(aes(x = mu, y = reorder(outcome, mu))) +
  geom_halfeyeh(point_interval = median_qi, .width = .95, size = 2/3) +
  labs(x = expression(italic("Cohen's d")),
       y = NULL) +
  theme(panel.grid   = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.y  = element_text(hjust = 0))

Or you could treat it as a problem of generating transformed linear predictors over a prediction grid by using tidybayes::add_fitted_draws along with modelr::data_grid:

library(tidybayes)
library(modelr)
spank %>%
  data_grid(outcome, se = 0) %>%
  add_fitted_draws(b14.6, re_formula = ~ (1|outcome)) %>%

  # plot
  ggplot(aes(x = .value, y = reorder(outcome, .value))) +
  geom_halfeyeh(point_interval = median_qi, .width = .95, size = 2/3) +
  labs(x = expression(italic("Cohen's d")),
       y = NULL) +
  theme(panel.grid   = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.y  = element_text(hjust = 0))

Let me know what you think of approaches like that --- I wasn't sure if you had pedagogical reasons for doing it a particular way. If you're amenable to changes like that, I'm happy to file pull requests in that direction (contingent on me having time for things, of course :) ).

Use of mi() when not needed (14.2.1)

Excellent resource you have! I found something which I think is not correct. In 14.2.1 you declare b_model. I think the line that reads:
bf(kcal | mi() ~ 1 + mi(neocortex) + logmass) +

should read
bf(kcal ~ 1 + mi(neocortex) + logmass) +

since kcal has no missingness. I don't think it affects the results, but it can be confusing.

Section 4.4.3.4. Predictions for E{height} at a given weight.

In this code snippet:

mu_at_50 <- 
  post %>% 
  transmute(mu_at_50 = b_Intercept + b_weight_c + 5.01)

Is the formula for mu_at_50 correct?

Or should we compute it as mu_at_50 = b_Intercept + b_weight_c * 5.01?

Models 8.4 and 8.5

I've been unable to figure out the brms equivalent to McElreath’s:

#8.4
m8.4 <- map2stan(
    alist(
        y ~ dnorm( mu , sigma ) ,
        mu <- a1 + a2 ,
        sigma ~ dcauchy( 0 , 1 )
), data=list(y=y) , start=list(a1=0,a2=0,sigma=1) , chains=2 , iter=4000 , warmup=1000 ) 

#8.5
m8.5 <- map2stan(
    alist(
        y ~ dnorm( mu , sigma ) ,
        mu <- a1 + a2 ,
        a1 ~ dnorm( 0 , 10 ) ,
        a2 ~ dnorm( 0 , 10 ) ,
        sigma ~ dcauchy( 0 , 1 )
), data=list(y=y) , start=list(a1=0,a2=0,sigma=1) , chains=2 , iter=4000 , warmup=1000 )

It's the double intercept. It has me stumped.

New book: An Introduction to Bayesian Data Analysis for Cognitive Science

In the Why this? section of the intro, briefly mention the new text by Nicenboim, Schad, and Vasishth, An Introduction to Bayesian Data Analysis for Cognitive Science, which highlights brms in several of the chapters. It might make sense to just do this in a footnote.

Section 4.4.3.5 Using `nesting` when doing posterior calculations

In Section 4.4.3.5.1 Overthinking: Rolling your own predict(), you use nesting(b_Intercept, b_weight_c, sigma) to calculate prediction intervals.

This seems to drop duplicate posterior draws. Is this the correct behavior? Shouldn't we use all draws from the posterior to do inference? That is, if we've already checked that there are no convergence issues.

# There are 4,000 samples from the posterior.
post %>%
  nrow()
#> [1] 4000

# But 3,949 unique samples.
post %>%
  select(b_Intercept, b_weight_c, sigma) %>%
  distinct() %>%
  nrow()
#> Warning: Dropping 'draws_df' class as required metadata was removed.
#> [1] 3949

# By nesting we drop the duplicates.
post %>%
  expand(
    nesting(b_Intercept, b_weight_c, sigma)
  ) %>%
  nrow()
#> [1] 3949

# But perhaps we should keep all the draws?
# Or thin the draws... If duplicates remain (very unlikely), keep them.
post %>%
  crossing(
    weight = 50 - mean(Howell1$weight)
  ) %>%
  nrow()
#> [1] 4000

Size of repo

Just cloned the repository today. Took hours. Size of folder is 6.12 GB (6,576,313,925 bytes). Why? Do I need the _book _bookdown_files folder to build the book?

List chapters in the correct order

The chapters below 10 are missing a leading zero so that the ordering of the chapters is messed up on github.

Simplification of figure 12.2.a

Thanks so much for this project, I found it immensely helpful. I wanted to suggest a more tidyversey way of creating figure 12.2.a:

# Draw 100 samples from the posterior
post <- posterior_samples(b12.2, subset = 1:100) %>%
  transmute(alpha = inv_logit_scaled(b_Intercept),
            sigma = inv_logit_scaled(sd_tank__Intercept))

# draw 1000 random numbers for each posterior sample
map2_dfc(post$alpha, post$sigma, ~rnorm(1e3, .x, .y)) %>% 
  gather %>% 
  ggplot(aes(x = value, group = key)) +
    geom_line(stat = "density", alpha = .2) +
    xlab("log-odds survival")

Hope you find this useful!

Adopt "file" argument of brm

In the current release version of brms (2.5.0) you can now store and recover your fitted model objects automatically via brm. If you like, you could add that when discussing to save model objects in chapter 15.

Model 11.5

I don’t believe brms allows for the beta-binomial likelihood, yet. So we don’t have a way to fit McElreath’s:

m11.5 <- map2stan(
    alist(
        admit ~ dbetabinom(applications,pbar,theta),
        logit(pbar) <- a,
        a ~ dnorm(0,2),
        theta ~ dexp(1)
    ),
    data=d,
    constraints=list(theta="lower=0"),
    start=list(theta=3),
    iter=4000 , warmup=1000 , chains=2 , cores=2 )

Correspondingly, we don’t have a good version of Figure 11.5.

Understanding fitted function in binomial model

Hi! Sorry for the silly question. I've been following this repository while reading the Statistical Rethinking book. I finally reached Ch10 and I've a question regarding the fitted values of a binomial model. I have this (your) code:

library(rethinking)
data(UCBadmit)
d <- UCBadmit
detach(package:rethinking)
library(brms)
rm(UCBadmit)

d <- 
    d %>%
    mutate(male = ifelse(applicant.gender == "male", 1, 0))

m10.6 <- brm(admit | trials(applications) ~ 1 + male, data = d, family = binomial,
             prior = c(set_prior("normal(0, 10)", class = "Intercept"),
                       set_prior("normal(0, 10)", class = "b")),
             iter = 2500, warmup = 500, cores = 2, chains = 2)

which is implementing the following model:

When I ran head(predict(m10.6)), I get the following table, which are the results of the summary statistics from computing rbinom(length_of_chain, number_of_applications, invlogit(evaluation_m10.6_per_chain_value)), i.e. summarising the samples from the first line in the model. Basically, I can replicate this table.

      Estimate Est.Error 2.5%ile 97.5%ile
[1,] 367.45325 16.026807     336      398
[2,]  32.83225  4.899144      24       43
[3,] 249.29800 13.106969     223      275
[4,]   7.56750  2.329657       3       12
[5,] 144.88200  9.530466     126      164
[6,] 180.04175 12.949901     154      205

However, when I ran fitted(predict(m10.6)), the results are the following, but I cannot understand how these estimated errors and percentiles are computed.

       Estimate Est.Error    2.5%ile   97.5%ile
[1,] 367.289295 7.9252189 352.243511 382.861665
[2,]  32.741944 1.1580285  30.418559  34.980130
[3,] 249.311521 5.3795425 239.098625 259.881858
[4,]   7.579154 0.2680622   7.041333   8.097252
[5,] 144.689722 3.1220559 138.762595 150.824292
[6,] 179.777528 6.3584342 167.020420 192.066827

Any explanation on this is highly appreciated.

me() to be depreciated

Add a link to this comment by Paul in the Stan Forums about the upcoming depreciation of the me() syntax: https://discourse.mc-stan.org/t/measurement-error-corrected-hierarchical-model-in-brms/8427/4.

mi() in chapter 14

See chapter 15 in the sister project for the 2nd edition for a corrected workflow for Section 14.1. In short, use more broom:tidy() and less fitted() when plotting the posteriors for the States.

These edits are now present in the https://github.com/ASKurz/Statistical_Rethinking_with_brms_ggplot2_and_the_tidyverse/blob/master/in_the_works/14.1--14.2_workflow_scrub.Rmd file. Just integrate this into the primary 14.Rmd file. Also, make sure to save the new b14.2_mi_mi fit in the fits folder.

Book takes several minutes to allow interaction

With my current version of Google Chrome, on two Windows 10 PCs the book takes a long time to load, and Chrome offers to kill the page because it believes it has crashed. If you wait long enough it will allow you to interact, but this can take over 2 minutes every time you load the page (https://bookdown.org/content/4857/).

I'm happy to provide further details and to assist in testing / debugging.