Code Monkey home page Code Monkey logo

Comments (7)

bob-carpenter avatar bob-carpenter commented on August 20, 2024 1

from blog.

mattansb avatar mattansb commented on August 20, 2024

Hi Bob,

Thanks for you comment!

I generally agree with your comment - MAPs have been shown to be poor estimates (but their appeal as "most probable value" / any their close match with MLEs is just too tempting I guess).

The "penalization" placed by the prior is not "good" for error rates, but also for estimation. One failure of freq methods is its lack of incorporation of priors (be they from previous data collection, or from subjective expectations), making all decisions and estimations based on the observed data alone.

We really should add some comment section (disqus for blogdown?)...

from blog.

strengejacke avatar strengejacke commented on August 20, 2024

Bob,

thank you for your comment! I',m not sure if you have noticed our accompanying paper related to the posting, Indices of Effect Existence and Significance in the Bayesian Framework (there was a kind of "mutual inspiration" for both the development the bayestestR package and writing the paper, thus the relationship...).

Deriving a point estimate isn't particularly Bayesian, but at least the posterior mean and median have natural probabilistic interpretations as an expectation and the point at which a 50% probability obtains. With those estimators, results will vary from MAP based on how skewed the distribution is.

I think what is important, from our perspective, not only in Bayesian inference, but also in a frequentist framework, is to take the uncertainty of estimation into account (or even focus on it). Thus we suggest using at least two indices of effect "existence" and "significance" (not necessarily in the sense of statistical significance). The pd is one of these possible indices that works well, in particular due to its 1:1 correspondence to the p-value, which makes it easier to attract Bayesian methods for "non-Bayesians". I guess this blog post could not carry this bigger picture of "focus on uncertainty"...

This leads to underestimates of lower-level regression coefficient uncertainty by construction, as you see in packages like lme4 in R.

You mean shrinkage? But isn't that a desired property of mixed models to avoid overly "strong effects" (estimates)?

When we want to do downstream predictive inference, we don't want to just plug in an estimate, we want to do posterior predictive inference and average over our uncertainty in parameter estimation.

We had a blogpost some months ago, where Aki responded (via email). He wrote about predictive model checking, but though I read some papers from Aki and you (just cursory), I did not fully understand how to best use posterior predictive inference (I hope I understood you correct here).

Part of the point of Bayesian inference is to not have to collapse to a point estimate.

True, but I think this should apply to frequentist framework as well.

A prior defined for a probability variable in [0, 1] that's flat is very different from a flat prior for a log odds variable in (-infinity, infinity).

That's a good point, in particular since we are working on a paper on how to use informative priors in rstanarm, and how different priors affect the model results.

but it's going to have the "wrong" effect on this notion of p-direction unless you talk about p-direction of the difference from the population estimate, rather than the random effect itself

I think I got what you mean. So it would make sense to "warn" the users of p_direction(), that - if applied to random effects - the result indicates the probability of being above/below the "group" average (being states or whatever)...

from blog.

bob-carpenter avatar bob-carpenter commented on August 20, 2024

I guess this blog post could not carry this bigger picture of "focus on uncertainty"...

That's somthing I'm 100% behind :-)

One failure of freq methods is its lack of incorporation of priors (...), making all decisions and estimations based on the observed data alone.

This isn't true. While a strict philosophical frequentist won't let you talk about a distribution over a parameter (so no prior or posterior distributions), they're totally OK with penalized maximum likelihood. That is, they're often OK introducing some bias for reduced variance. Given parameters theta, data y and likelihood p(y | theta), the max likelihood estimate is

theta* = ARGMAX_{theta} log p(y | theta).

A penalized maximum likelihood estimate just adds a penalty term. For instance, the equivalent of a normal prior would be an L2 penalty, e.g.,

theta* = ARGMAX_theta log p(y | theta) - 0.5 * lambda * SUM_{n in 1:N} theta[n]^2.

The resulting penalized MLE is equivalent to the MAP estimate a Bayesian model with theta assigned a normal(0, lambda) prior.

Here's a link to the classic Efron and Morris (1975) paper about Stein's estimator. They go much further and build a hierarchical model to which they apply max marginal likelihood by marginalizing out the lower-level coefficients and optimizing the hierarchical regularization parameters. This is the same as modern system like lme4 (from R). Although that technique is often called "empirical Bayes", it's a point estimation technique based on optimization and totally kosher among frequentists.

The pd is one of these possible indices that works well, in particular due to its 1:1 correspondence to the p-value

It kind of looks like a p-value, but it doesn't act like a p-value, so I fear this is just going to confuse people. Worse yet, I fear it may encourage them to think in terms of a binary significant/not-signficant decision, which is what we want to leave behind with the frequentists. Instead, we want to do something like decision analysis (e.g., as described in Bayesian Data Analysis) to make decisions, which considers not only posterior uncertainty, but magnitude and its effect on things we care about like utility of decisions.

You mean shrinkage? But isn't that a desired property of mixed models to avoid overly "strong effects" (estimates)?

No, I mean regularization to population means. For example, see the baseball ability estimates from the Efron and Morris paper I link above. The idea is that you only have a few dozen observations per player, so you want to shrink not to zero, but to the population average. Then if you see no data, your guess is the population average, not zero.

The point of mixed models is to figure out how much pooling to do between no pooling (infinite hierarchical variance) on one side to complete pooling (hierarchical variance zero) on the other. I discuss this in my Stan case study that recasts Efron and Morris in Bayesian terms.

where Aki responded (via email)

Glad to hear I'm not the only person obsessing about definitions on the web.

I did not fully understand how to best use posterior predictive inference (I hope I understood you correct here).

The idea is that you average over your posterior, so that's

p(y' | y) = INTEGRAL p(y' | theta) * p(theta | y) d.theta

which in MCMC terms is just

p(y' | y) =approx= 1/M SUM_{m in 1:M} p(y' | theta(m))

where theta(m) is a draw from the posterior `p(theta | y). It's not out yet, but I have a PR that adds new chapters to the Stan user's guide to try to explain this in simpler terms than in sources like Bayesian Data Analysis (of which Aki is a co-author). I also have a simple case study where I compare plugging in the MAP estimate, the posterior mean estimate, or doing full Bayesian posterior predictive inference for a simple logistic regression, and posterior predictive inference dominates other methods when so-called "proper scoring metrics" are used (like log loss or square error).

Part of the point of Bayesian inference is to not have to collapse to a point estimate.

True, but I think this should apply to frequentist framework as well.

How so?

P.S. You might want to compare notes with Rasmus Bååth's work on his Bayesian first aid package, which translates many classical notions like hypothesis tests to Bayesian frameworks.

from blog.

mattansb avatar mattansb commented on August 20, 2024

Thanks Bob for all the explanations! It wan't clear to me until now how hierarchical shrinkage actually works!

It kind of looks like a p-value, but it doesn't act like a p-value

Can you elaborate on this? To me, they hold the same amount of information (for p-values in NHST at least), and the behave the same in the sense that they are consistent under H1 (p-value gets smaller, and pd gets bigger), but not under H0 (p-value is uniform between 0-1, pd is uniform between 0.5-1).
(I totally agree that it might encourage dichotomous thinking, but as @strengejacke said, we recommend not reporting only the pd).

from blog.

bob-carpenter avatar bob-carpenter commented on August 20, 2024

from blog.

mattansb avatar mattansb commented on August 20, 2024

Right, computation and interpretation are (and should be) different (likelihood vs posterior prob), but it is still hard to ignore the basically 1:1 correspondence between the values, and other properties / behaviors of the p-value and pd (for the lay user, at the very least, and I think that is what we were aiming for here)...

Thanks for engaging! Definitely got a lot to mull over!

from blog.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.