bayesian algorithm
Ecological Fallacy
-
Write a generative simulation involving a predictor variable π, an outcome variable π and a grouping variable πΊ β {1, 2}. The simulation should randomly generate a total number of π = 500 observations and match the following constraints: β’ Approximately half of the observations are clustered in each group, π1, π2 β 250. β’ ππΊ is normally distributed with group means π1 β 100 and π2 β 80 and standard deviations of 15. β’ ππΊ,π is normally distributed with a standard deviation of 10 and group means π πΊ,π = 50 + ππΊππ. β’ The effects of π on π are π1 = 0 and π2 = β.5.
-
Create one or more visualizations of the relation between π and π illustrating the group differences in π, π , and π and the ecological fallacy that would result from ignoring the group differences.
-
Analyze the simulated data using a Bayesian linear model predicting π from π. Use quadratic approximation or MCMC. Use informative priors that donβt match the parameters from the generative simulation, e.g., place a larger proportion of prior mass on parameter values you know to be false. (Intentionally setting βbadβ priors here illustrates how the models learn from the data.) β’ Model 1: Run a model that doesnβt account for group differences to statistically corroborate the ecological fallacy shown in the previous visualizations. β’ Model 2: Run a model that does account for group differences and captures the parameter settings from the generative simulation. How many priors are required to estimate the model?
For Tasks 4 and 5, continue with Model 2. 4. Compute and plot the posterior distribution of differences for the group-specific slope parameters using 1,000 samples from the joint posterior distribution.
- Conduct a posterior predictive check. β’ For each π value simulated in Task 1, predict a π value using the posterior distributions from Model 2. β’ Create one or more visualizations of the posterior predictions illustrating the ability of the model to capture and predict the association of π and π . Tipp: To predict values for N people, you should use/draw N samples from the posterior distribution. Also consider possible differences in group size.
More Groups in Data
Read in the data set WorldData.csv. For the remaining tasks, you will focus on the relations between life expectancy, freedom of choice, and the regional data (region/ region_index/country). Briefly familiarize yourself with the variables using visualizations and descriptive statistics. Tipp: All numerical variables but log_gdp are standardized. The goal is to study the association between freedom of choice and life expectancy. There will be many possible and rather complex causal routes linking the two measures, involving the influence of large set of confounds. A proper statistical model would require a lot more thought than can be asked for in this assignment. In the following, we only stratify by region, assuming that some of the more impactful confounds are embedded therein, and then we hope for the best. 6. Compute the frequency distribution for the variable region. Briefly discuss potential problems we could run in when stratifying for regionβthat is, when computing a separate regression model for each region.
-
Analyze the data while stratifying for region using quadratic approximation or MCMC. β’ Model 3: Estimate a Bayesian Gaussian model for the variable life expectancy. β’ Model 4: Estimate a Bayesian linear model predicting life expectancy from freedom of choice.
-
Below, you find the code and output of a multilevel model for the association between of freedom of choiceand life expectancy. How and why do the estimates differ compared to the previous Model 4?