title | author | date | output | ||||||
---|---|---|---|---|---|---|---|---|---|
Banded or faded confidence intervals |
David Hood |
3/10/2018 |
|
I don't like confidence intervals as normally presented (though I acknowledge it is often for journal style/economics reasons outside of the author's control), so here is two alternatives: banded and fade-out, for when you have the freedom to do better justice to the data.
I initially began doing these with a mass of custom code in base plot. This is a ggplot2 version, involving a bit less code and a few more libraries. This is very much an "intermediate" R walkthrough of what I do, in as much as I am not explaining ggplot or data transformation, just showing how I do one particular affect.
library(dplyr)
library(tidyr)
library(binom)
library(purrr)
library(ggplot2)
library(ggthemes)
I am using New Zealand hospital discharge data. The small csv file contains the variables:
crash_year - calendar year the accident took place in
selfowns - the number of accidents which were either the cyclist crashing into a stationary object, or a non-collision accident (so no other party involved) such as falling over
occurances - the total number of accidents, of which selfowns is a subset
This is real data, but this is also a graphing demonstration so the actual data is not the focus.
cyclists <- read.csv("hospital_cycle_discharges.csv")
threecolors <- c("#428543", "#77C579", "#DCEEA2")
First, a traditional, nuance free, graph.
cyclists %>% mutate(proportion_selfown = selfowns/occurances) %>%
ggplot(aes(x=crash_year, y=proportion_selfown)) + geom_bar(stat="identity") +
theme_tufte() + ylim(0,1) + ggtitle("Cyclist solo accidents as proportion of accidents")
Calculating out solid colour bands takes a bit of work (it could be approximated in easier ways, but I am doing the full work as the confidence intervals are not always symmetric around the point estimate with binomials). For this nest/models/unnest approach see the book R for Data Science, by Hadley Wickham, the many models section.
In short, I am calculating multiple binomial confidence thresholds for each entry.
thresholds <- function(df){
c1 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.68, method="exact")
c2 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.95, method="exact")
c3 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.997, method="exact")
df_out <- data.frame(
pt_est = c1$mean,
c1_lower = c1$lower,
c1_upper = c1$upper,
c2_lower = c2$lower,
c2_upper = c2$upper,
c3_lower = c3$lower,
c3_upper = c3$upper
)
return(df_out)
}
grf_band <- cyclists %>% group_by(crash_year) %>% nest() %>%
mutate(ci_range= map(data, thresholds)) %>% unnest(data,ci_range)
Next, I use those thresholds to draw rectangles
ggplot(grf_band) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=c3_lower,
xmax=crash_year + 0.5,
ymax=c3_upper),
fill=threecolors[3]) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=c2_lower,
xmax=crash_year + 0.5,
ymax=c2_upper),
fill=threecolors[2]) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=c1_lower,
xmax=crash_year + 0.5,
ymax=c1_upper),
fill=threecolors[1]) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=pt_est+0.001,
xmax=crash_year + 0.5,
ymax=pt_est-0.001),
fill="#000000") +
theme_tufte() + ylim(0,1) + ggtitle("Cyclist solo accidents as proportion of accidents")
I feel that using transparency needs a lot more work, for two reasons-
-
I see an inherent contradiction between how people expect fading out to work (as a smooth gradient over an area) and having the fade gradient be true to the data (as most of the probability mass is clustered around the point estimate). Of the two, I prefer to go with representing where the probabilities are, so need to put in quite a bit of work matching amount of fade out to location of probability mass, so creating an uneven fade.
-
Because there is transparency, you can't take the cheat I used in the previous graph of overlaying rectangles, you need to only draw the specific range, which means drawing two times the number of rectangles you want
thresholds <- function(df){
to10 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.1, method="exact")
to20 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.2, method="exact")
to30 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.3, method="exact")
to40 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.4, method="exact")
to50 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.5, method="exact")
to60 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.6, method="exact")
to70 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.7, method="exact")
to80 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.8, method="exact")
to90 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.9, method="exact")
to99.7 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.997, method="exact")
df_out <- data.frame(
pt_est = to10$mean,
p10_lower = to10$lower,
p10_upper = to10$upper,
p20_lower = to20$lower,
p20_upper = to20$upper,
p30_lower = to30$lower,
p30_upper = to30$upper,
p40_lower = to40$lower,
p40_upper = to40$upper,
p50_lower = to50$lower,
p50_upper = to50$upper,
p60_lower = to60$lower,
p60_upper = to60$upper,
p70_lower = to70$lower,
p70_upper = to70$upper,
p80_lower = to80$lower,
p80_upper = to80$upper,
p90_lower = to90$lower,
p90_upper = to90$upper,
p99.7_lower = to99.7$lower,
p99.7_upper = to99.7$upper
)
return(df_out)
}
grf_alpha <- cyclists %>% group_by(crash_year) %>% nest() %>%
mutate(ci_range= map(data, thresholds)) %>% unnest(data,ci_range)
The graphing is the same principles as before, with a lot more rectangles.
ggplot(grf_alpha) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=p10_lower,
xmax=crash_year + 0.5,
ymax=pt_est),
fill=threecolors[1], alpha=1) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=pt_est,
xmax=crash_year + 0.5,
ymax=p10_upper),
fill=threecolors[1], alpha=1) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=p20_lower,
xmax=crash_year + 0.5,
ymax=p10_lower),
fill=threecolors[1], alpha=0.9) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=p10_upper,
xmax=crash_year + 0.5,
ymax=p20_upper),
fill=threecolors[1], alpha=0.9) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=p30_lower,
xmax=crash_year + 0.5,
ymax=p20_lower),
fill=threecolors[1], alpha=0.8) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=p20_upper,
xmax=crash_year + 0.5,
ymax=p30_upper),
fill=threecolors[1], alpha=0.8) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=p40_lower,
xmax=crash_year + 0.5,
ymax=p30_lower),
fill=threecolors[1], alpha=0.7) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=p30_upper,
xmax=crash_year + 0.5,
ymax=p40_upper),
fill=threecolors[1], alpha=0.7) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=p50_lower,
xmax=crash_year + 0.5,
ymax=p40_lower),
fill=threecolors[1], alpha=0.6) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=p40_upper,
xmax=crash_year + 0.5,
ymax=p50_upper),
fill=threecolors[1], alpha=0.6) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=p60_lower,
xmax=crash_year + 0.5,
ymax=p50_lower),
fill=threecolors[1], alpha=0.5) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=p50_upper,
xmax=crash_year + 0.5,
ymax=p60_upper),
fill=threecolors[1], alpha=0.5) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=p70_lower,
xmax=crash_year + 0.5,
ymax=p60_lower),
fill=threecolors[1], alpha=0.4) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=p60_upper,
xmax=crash_year + 0.5,
ymax=p70_upper),
fill=threecolors[1], alpha=0.4) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=p80_lower,
xmax=crash_year + 0.5,
ymax=p70_lower),
fill=threecolors[1], alpha=0.3) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=p70_upper,
xmax=crash_year + 0.5,
ymax=p80_upper),
fill=threecolors[1], alpha=0.3) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=p90_lower,
xmax=crash_year + 0.5,
ymax=p80_lower),
fill=threecolors[1], alpha=0.2) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=p80_upper,
xmax=crash_year + 0.5,
ymax=p90_upper),
fill=threecolors[1], alpha=0.2) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=p99.7_lower,
xmax=crash_year + 0.5,
ymax=p90_lower),
fill=threecolors[1], alpha=0.1) +
geom_rect(aes(xmin=crash_year - 0.5,
ymin=p90_upper,
xmax=crash_year + 0.5,
ymax=p99.7_upper),
fill=threecolors[1], alpha=0.1) +
theme_tufte() + ylim(0,1) + ggtitle("Cyclist solo accidents as proportion of accidents")
Because I am making a lot of rectangles, I can cut down on the graphing code by transforming the data from data of observations with confidences interval thresholds, to data of confidence interval bands of point estimates. As this makes each confidence interval range a tidy observation, it is much easier to graph.
Developing this approach needs a certain amount of thinking out what data structure answers my need, and how to I change my data into that structure. In particular, because in this approach I am including the alpha value in the data, I need to be mindful that the thresholds values order linearly through the probabilities, but the alpha values rise and fall.
thresholds <- function(df){
to10 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.1, method="exact")
to20 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.2, method="exact")
to30 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.3, method="exact")
to40 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.4, method="exact")
to50 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.5, method="exact")
to60 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.6, method="exact")
to70 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.7, method="exact")
to80 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.8, method="exact")
to90 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.9, method="exact")
to99.7 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.997, method="exact")
df_out <- data.frame(
t01 = to99.7$lower,
t02 = to90$lower,
t03 = to80$lower,
t04 = to70$lower,
t05 = to60$lower,
t06 = to50$lower,
t07 = to40$lower,
t08 = to30$lower,
t09 = to20$lower,
t10 = to10$lower,
t11 = to10$mean,
t12 = to10$upper,
t13 = to20$upper,
t14 = to30$upper,
t15 = to40$upper,
t16 = to50$upper,
t17 = to60$upper,
t18 = to70$upper,
t19 = to80$upper,
t20 = to90$upper,
t21 = to99.7$upper )
return(df_out)
}
grf_alpha_tdy <- cyclists %>% group_by(crash_year) %>% nest() %>%
mutate(ci_range= map(data, thresholds)) %>% unnest(ci_range) %>%
select(-data) %>% ungroup() %>% gather(threshold, ythreshold1, t01:t21) %>%
group_by(crash_year) %>% arrange(threshold) %>%
mutate(ythreshold2 = lead(ythreshold1)) %>% filter(!is.na(ythreshold2)) %>%
mutate(alpha_value = c(1:10,10:1)/10) %>% ungroup() %>% arrange(crash_year, threshold)
Having transformed the data so that the observations are the things I need to achieve my goal, the graph is much easier
ggplot(grf_alpha_tdy, aes(xmin=crash_year - 0.5,
ymin=ythreshold1,
xmax=crash_year + 0.5,
ymax=ythreshold2,
alpha=alpha_value)) +
geom_rect(fill=threecolors[1]) +
theme_tufte() + ylim(0,1) +
ggtitle("Cyclist solo accidents as proportion of accidents") + theme(legend.position="none")