betanalpha / knitr_case_studies Goto Github PK
View Code? Open in Web Editor NEWInference case studies in knitr
Inference case studies in knitr
I noticed what may be a minor typo in your excellent Gaussian Processes tutorial. I believe that line 1342 (at commit 8db2461) of knitr_case_studies/gaussian_processes/gaussian_processes.Rmd
should be
N_obs=data$N_obs, x_obs=data$x_obs, y_obs=data$y_obs,
currently it is
N_obs=data$N_obs, x=data$x_obs, y_obs=data$y_obs,
I just tried to compile the principled bayesian workflow case study, but this fails with a stan syntax error.
In sample_joint_ensemble4.stan
you declare an array y[N]
but try to assing to y_ppc[n]
in the loop. Changig y_ppc
to y
would be consistent with the other sample*.stan
files.
The case study "Towards A Principled Bayesian Workflow" presents two possible definitions of SBC ranks:
and
with
Similarly if the estimated posterior is biased low then the prior ranks will be biased high and the SBC histogram will concentrate at the right boundary.
For the second definition, you say
then a low bias in the posterior fit will manifest in a SBC histogram that concentrates at the left boundary.
My question is if these two interpretations shouldn't be inversed.
\mathcal{} symbols like chi-square are not rendered properly in html. They appear as an empty rectangles in a web browser.
I am not sure if I got everything from the 'Underdetermined linear regression' case study correctly, but line 50 of linear_regression_w_proj.stan is
gamma_par = append_row(beta, alpha) - gamma_perp;
though I would have thought it should be
gamma_par = (append_row(beta, alpha) - R) - gamma_perp;
Am I getting something wrong or is this a typo?
Hi Michael,
Thanks for these write-ups; they are fantastic. In "Identifying Bayesian Mixture Models" I think the sum in the first equation of section 4 should be a product, to properly define the likelihood.
Best,
Connor
Thanks for the great article on the Bayesian workflow. This is just a minor comment on the terminology of censoring and truncation in the given example case.
The text says:
"When the student repeated the measurement for detectors in this state they unintentionally induced a censored obsevation where the Poisson distribution for the observations is cutoff at a certain value."
But based on the terminology that I've learned from survival analysis, this would actually be truncation rather than censoring, since the observations that went over the scale of the sensor are not recorded at all in the data (they would be censored, if the student had recorded each case where the measurement went over the scale). This terminology seems also to follow the one in Bayesian Data Analysis book (2nd edition), Chapter 7.8.
(There's also a typo "obsevation" in the quoted sentence above.)
Thank you for all your excellent case studies!
In the right plot of figure multi.tex in the Ordinal Regression case study, some of the cut point indices are wrong.
I think it should be:
Since the prior is a half-normal (i.e., truncated at 0), shouldn't we use qnorm(0.995)
instead of qnorm(0.99)
to get the prior standard deviation?
TruncatedNormal::qtnorm(0.99, 0, 15 / qnorm(.995), 0, Inf)
#> [1] 15
TruncatedNormal::qtnorm(0.99, 0, 15 / qnorm(.99), 0, Inf)
#> [1] 16.60863
# where
15 / qnorm(.995)
#> [1] 5.823367
15 / qnorm(.99)
#> [1] 6.447875
Created on 2019-11-18 by the reprex package (v0.3.0)
Hi Michael, thanks for writing up all the tutorials. I wasn't sure how you'd like feedback/typo catching, so I'll just post here. In the doc probability theory, the complement of B1 should be in the set of Reals.
Best,
Scott
Conditional not onditional in the following
The paragraph talks about f: X -> R, and then talks about E(g). Recommend making it E(f)
There are 2 places in principled_bayesian_workflow.Rmd
where the expression
for the sd of an inverse gamma is used viz:
prior_sd_lambda <- sqrt( (9.21604)**2 / ((3.48681 - 1)**2 * (3.48681 - 1)) )
I think this should be (note the -2):
prior_sd_lambda <- sqrt( (9.21604)**2 / ((3.48681 - 1)**2 * (3.48681 - 2)) )
BTW: awesome work!
Thank you @betanalpha for this informative and hands on tutorial on the workflow!
I am confused and had been thinking for days on the sbc_rank of prior draw with respect to thinned posterior draws. the distribution of these rankings are supposed to be uniform, and for good models the result seems to be uniform.
My question, though, is why isn't it more concentrated? the code seems to calculate the rank of the true parameter that were used to generate the date against the fitted draws from posterior. If the posterior is well-fitted, wouldn't we expect to recover this parameter?
of course the choice of prior matters, in this case it means the posterior would have a tendency to go for smaller values.
then based on the code, I would expect the mode of the rank to be smaller than 250 (4000/8/2), yet the result was uniform.
Is there any resources that I can read to clear my head?
thanks
Hi Mike,
I have a couple of questions about the stan_utility
script:
stan_utility.R
's latest version is in the stan_intro
folder, but would it help to have a pull request that updates it for all the other folders?check_div
, check_treedepth
, and check_energy
fully substitutable with rstan's own functions, with the caveat that it is rstan::check_divergence
and not rstan::check_div
?check_n_eff
and check_rhat
aren't in the package itself---or did I miss something?My Stan is a bit rusty at the moment.
I'm not sure whether I am interpreting the output message correctly, for the same stan fit object I get:
check_treedepth(tabi_res$fit, max_depth = 10)
[1] "6 of 1000 iterations saturated the maximum tree depth of 10 (0.6%)"
[1] " Run again with max_depth set to a larger value to avoid saturation"
> check_treedepth(tabi_res$fit, max_depth = 12)
[1] "988 of 1000 iterations saturated the maximum tree depth of 12 (98.8%)"
[1] " Run again with max_depth set to a larger value to avoid saturation"
Shouldn't the first check include the interactions of the second check?
The stan_utility.R functions are great for model checking. If anyone wants to use them with models fit using the 'rstanarm' package, the following modified functions can be used.
# copied from
# https://github.com/betanalpha/knitr_case_studies/blob/master/rstan_workflow/stan_utility.R
# modified to work with models fit using rstanarm
# Tidybayes to extract model parameters and stuff
library(tidybayes) # Tested on tidybayes version 1.0.3
library(tidyverse)
library(rstanarm)
# Check transitions that ended with a divergence
check_div <- function(fit, quiet=FALSE) {
sampler_params <- fit %>% tidy_draws()
divergent <- sampler_params[['divergent__']]
n = sum(divergent)
N = length(divergent)
if (!quiet) print(sprintf('%s of %s iterations ended with a divergence (%s%%)',
n, N, 100 * n / N))
if (n > 0) {
if (!quiet) print(' Try running with larger adapt_delta to remove the divergences')
if (quiet) return(FALSE)
} else {
if (quiet) return(TRUE)
}
}
# Check transitions that ended prematurely due to maximum tree depth limit
check_treedepth <- function(fit, max_depth = 15, quiet=FALSE) {
sampler_params <- fit %>% tidy_draws()
treedepths <- sampler_params[['treedepth__']]
print(paste('Max treedepths reached is',max(treedepths)))
n = length(treedepths[sapply(treedepths, function(x) x == max_depth)])
N = length(treedepths)
if (!quiet)
print(sprintf('%s of %s iterations saturated the maximum tree depth of %s (%s%%)',
n, N, max_depth, 100 * n / N))
if (n > 0) {
if (!quiet) print(' Run again with max_treedepth set to a larger value to avoid saturation')
if (quiet) return(FALSE)
} else {
if (quiet) return(TRUE)
}
}
# Checks the energy fraction of missing information (E-FMI)
check_energy <- function(fit, quiet=FALSE) {
sampler_params <- fit %>% tidy_draws()
no_warning <- TRUE
energies = sampler_params[['energy__']]
numer = sum(diff(energies)**2) / length(energies)
denom = var(energies)
print(paste('Energy is ',numer / denom))
if (numer / denom > 0.2) {
if (!quiet) print('E-BFMI indicated no pathological behavior')
if (quiet) return(TRUE)
} else {
if (!quiet) print(' E-BFMI below 0.2 indicates you may need to reparameterize your model')
if (quiet) return(FALSE)
}
}
# Checks the effective sample size per iteration
check_n_eff <- function(fit, quiet=FALSE) {
fit_summary <- fit$stan_summary
N <- dim(fit_summary)[[1]]
iter <- prod(dim(fit$stanfit)[1:2])
no_warning <- TRUE
ratvals = c()
for (n in 1:N) {
ratio <- fit_summary[n,'n_eff'] / iter
ratvals[n] = ratio
if (ratio < 0.001) {
if (!quiet) print(sprintf('n_eff / iter for parameter %s is %s!',
rownames(fit_summary)[n], ratio))
no_warning <- FALSE
}
}
print(paste('lowest n_eff is ',min(ratvals)))
if (no_warning) {
if (!quiet) print('n_eff / iter looks reasonable for all parameters')
if (quiet) return(TRUE)
}
else {
if (!quiet) print(' n_eff / iter below 0.001 indicates that the effective sample size has likely been overestimated')
if (quiet) return(FALSE)
}
}
# Checks the potential scale reduction factors
check_rhat <- function(fit, quiet=FALSE) {
fit_summary <- fit$stan_summary
N <- dim(fit_summary)[[1]]
ratvals = c()
no_warning <- TRUE
for (n in 1:N) {
rhat <- fit_summary[n,'Rhat']
ratvals[n] = rhat
if (rhat > 1.1 || is.infinite(rhat) || is.nan(rhat)) {
if (!quiet) print(sprintf('Rhat for parameter %s is %s!',
rownames(fit_summary)[n], rhat))
no_warning <- FALSE
}
}
print(paste('highest rhat is ',max(ratvals)))
if (no_warning) {
if (!quiet) print('Rhat looks reasonable for all parameters')
if (quiet) return(TRUE)
} else {
if (!quiet) print(' Rhat above 1.1 indicates that the chains very likely have not mixed')
if (quiet) return(FALSE)
}
}
Hello, Michael
I was reading your case study on Probability Theory Concepts titled "Probability Theory (For Scientists and Engineers)" and I think I found a typo.
On the 4th paragraph of Section 2.1 Probability Distributions, you write:
Moreover, any distribution of probability should _globally consistent_ over any
disjoint decomposition of the total space.
I believe there is a missing "be" between "should" and "globally".
I know this is minor and won't affect the comprehension of your reader, yet, your work is so thorough and top-notch that I thought you'd like to know about it so you could make the correction should you judge it necessary.
P.S.: I'd also like to take this opportunity to thank you for your generosity. It's not every day that we find sterling work like yours available as free online tutorials. It's of prime quality and deeply insightful. Your clear written explanation aided by sharp visualizations helps all of us, non-Mathematicians, to have a firmer grip on vital concepts that sometimes are poorly covered at college-level courses.
Would you mind taking a minute to look at the last two equations of subsection 3.4 in Probability Theory on Product Spaces? There are two issues that I'm struggling to wrap my head around.
P(A | B) = P(B|A) P(A) / P(B),
but it appears that you have the same conditioning happening on the left hand side as on the right hand side.
\pi(x_2 | \tilde{x}_1) = \frac{ \pi(x_2 | \tilde{x}_1) }{ constant } \pi(x_2)
Given Section 1.3, it seems that \pi(\tilde{x}_1 | x_2) should rightfully be function defined on x_1 \cross X_2. Is this interpretation correct?
Thanks for these great case studies. I really appreciate the clarity of mathematical concepts that you offer, especially when it comes to random variables.
First, thank you for providing these studies. I have been looking for probability theory intro material to provide to my data science students & advisees, and I think your studies will be quite useful for their intermediate-level study of probability.
There are two things that tripped me up, though.
The first is the pushforward map definition. Since
The second is the definition of absolutely continuous in 4.3. The statement:
A measure
$\nu$ is absolutely continuous with respect to another measure$\mu$ when$\nu$ allocates zero volume only to those sets which$\mu$ also allocates zero volume
My translation of this, consistent with the 'only if' in the following formula, would be that
A measure
$\nu$ is absolutely continuous with respect to another measure$\mu$ when$\nu$ allocates zero volume to all sets to which$\mu$ also allocates zero volume
It's quite possible I'm missing something, as I am pretty new to measure theory, but I don't see how these definitions don't contradict.
Awesome resource Michael. However, I think there's a conflict between your description of the non-centered implementation of the Eight Schools model and the actual Stan code. The code states that "theta[j] = mu + tau * theta_tilde[j]" whereas the written description of the model has sigma in place of tau. Figured it would be worth the clarification for those (like myself) who are just getting familiar with coding non-centered versions of models.
Hi there - I think I've spotted some typos in Section 4.3.4 of probability on product spaces, specifically in the equations in the figures:
dgm_condition_1.png
- the first line should be condtional on \tilde{x}_1 rather than \tilde{x}_4dgm_condition_2.png
- on the last line, in the final two factors, x_4 should be replaced by x_1Also: thanks for all the great resources!
In the case study "Towards A Principled Bayesian Workflow", I think there's a typo in the definition of the posterior z-score (section 1.3.2 "A Bayesian Eye Chart"): Shouldn't the denominator be the marginal posterior standard deviation instead of the marginal posterior mean (see also here)?
Hi!
I think the Metropolis-Hastings conditional PDF in markov_chain_monte_carlo.Rmd
should be:
$$
T(q' \mid q) = a(q', q) \cdot Q (q' \mid q )
+ \left(1 - \int a(q, q') Q(q' \mid q) d q' \right) \cdot \delta(q - q').
$$
instead of
$$
T(q' \mid q) = a(q', q) \cdot Q (q' \mid q )
+ \left(1 - a(q, q') \right) \cdot \delta(q - q').
$$
Else, the conditional PDF T(q' | q) doesn't integrate to 1. See, for instance, pg 3 of this note.
Great work by the way! I found the bit on correctly estimating the maximum lag for the autocorrelation very useful.
The figures show:
P(A_5)+P(A_6) = P (A_1) = 1
(and similarly the one next to it). That does not look correct as P(A_1) + P (A_2) = P(A_0) = 1. It should be:
P(A_5) + P(A_6) = P(A_1) (but not equal to 1).
Your gaussian processes 3 write-up states that the priors for the hyper parameters are half normal,
e.g.
$$\alpha \sim $ half-
and the corresponding code is
parameters{ ... real<lower=0> alpha; } model{ alpha ~ normal(0,2); ... }
But according to the stan reference manual (v 2.16.0) pg. 400: "If a variable X is declared to have a lower bound
which in this case would just be the log, suggesting that the new random variable is log normal, not half - would you mind clearing up this discrepancy for me?
I'm using a similar parameterization in a model I'm working with and would like to know the "real" prior distribution I'm using. - Many Thanks.
I am a bit confused why the plots for Predictive Realizations and Posterior Predictive Realizations (in Part 1 of Gaussian Processes) are identical.
Could it be that plot_gp_realizations
was used, mistakenly, instead of plot_gp_pred_realizations
to plot the Posterior Predictive Realizations?
I like your ordinal regression case study, and the approach to setting priors on the cutpoints is relevant to some models I am working on. I think that the determinant of that Jacobian matrix does have an analytic determinant, though. I have not written up a full proof, but I am convinced that the determinant is the product of the diagonal times the matrix dimension. In R, it would be something like
prod(diag(J)) * nrow(J)
Here is some code for testing this expression on matrices of the specified structure, with random entries:
nrep <- 100
## dimension of matrix:
nr <- sample(3:20, 100, replace=TRUE)
res <- matrix(NA, nrep, 2)
for(i in 1:nrep){
## matrix entries:
ents <- rnorm((nr[i] - 1), sd=2)
tmpmat <- matrix(0, nr[i], nr[i])
tmpmat[,1] <- 1
for(j in 1:(nr[i] - 1)){
tmpmat[j:(j+1), (j+1)] <- c(ents[j], -ents[j])
}
res[i,1] <- det(tmpmat)
res[i,2] <- prod(diag(tmpmat)) * nr[i]
}
all.equal(res[,1], res[,2])
In the case study "Towards A Principled Bayesian Workflow", I think the sentence
On the other hand a posterior distribution with both a small posterior z-score and a small posterior contraction indicates ideal learning behavior.
has to read
"On the other hand a posterior distribution with both a small posterior z-score and a high posterior contraction indicates ideal learning behavior."
Otherwise, it would also contradict with the sentence
Likewise a posterior distribution with a small posterior z-score and a small posterior contraction is only poorly informed by the observations.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.