The banded_ci_demo from thoughtfulbloke

title

author

date

output

Banded or faded confidence intervals

David Hood

3/10/2018

html_document

word_document

keep_md
true

default

Confidence Intervals

I don't like confidence intervals as normally presented (though I acknowledge it is often for journal style/economics reasons outside of the author's control), so here is two alternatives: banded and fade-out, for when you have the freedom to do better justice to the data.

I initially began doing these with a mass of custom code in base plot. This is a ggplot2 version, involving a bit less code and a few more libraries. This is very much an "intermediate" R walkthrough of what I do, in as much as I am not explaining ggplot or data transformation, just showing how I do one particular affect.

library(dplyr)
library(tidyr)
library(binom)
library(purrr)
library(ggplot2)
library(ggthemes)

Example Data

I am using New Zealand hospital discharge data. The small csv file contains the variables:

crash_year - calendar year the accident took place in

selfowns - the number of accidents which were either the cyclist crashing into a stationary object, or a non-collision accident (so no other party involved) such as falling over

occurances - the total number of accidents, of which selfowns is a subset

This is real data, but this is also a graphing demonstration so the actual data is not the focus.

cyclists <- read.csv("hospital_cycle_discharges.csv")
threecolors <- c("#428543", "#77C579", "#DCEEA2")

A basic view

First, a traditional, nuance free, graph.

cyclists %>% mutate(proportion_selfown = selfowns/occurances) %>%
  ggplot(aes(x=crash_year, y=proportion_selfown)) + geom_bar(stat="identity") + 
  theme_tufte() + ylim(0,1) + ggtitle("Cyclist solo accidents as proportion of accidents")

A banded view

Calculating out solid colour bands takes a bit of work (it could be approximated in easier ways, but I am doing the full work as the confidence intervals are not always symmetric around the point estimate with binomials). For this nest/models/unnest approach see the book R for Data Science, by Hadley Wickham, the many models section.

In short, I am calculating multiple binomial confidence thresholds for each entry.

thresholds <- function(df){
  c1 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.68, method="exact")
  c2 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.95, method="exact")
  c3 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.997, method="exact")
  df_out <- data.frame(
    pt_est = c1$mean,
    c1_lower = c1$lower,
    c1_upper = c1$upper,
    c2_lower = c2$lower,
    c2_upper = c2$upper,
    c3_lower = c3$lower,
    c3_upper = c3$upper
  )
  return(df_out)
}

grf_band <- cyclists %>% group_by(crash_year) %>% nest() %>% 
  mutate(ci_range= map(data, thresholds)) %>% unnest(data,ci_range)

Next, I use those thresholds to draw rectangles

ggplot(grf_band) +
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=c3_lower,
                xmax=crash_year + 0.5,
                ymax=c3_upper),
            fill=threecolors[3]) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=c2_lower,
                xmax=crash_year + 0.5,
                ymax=c2_upper),
            fill=threecolors[2]) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=c1_lower,
                xmax=crash_year + 0.5,
                ymax=c1_upper),
            fill=threecolors[1]) + 
    geom_rect(aes(xmin=crash_year - 0.5,
                ymin=pt_est+0.001,
                xmax=crash_year + 0.5,
                ymax=pt_est-0.001),
            fill="#000000") + 
  theme_tufte() + ylim(0,1) + ggtitle("Cyclist solo accidents as proportion of accidents")

Using transparency

I feel that using transparency needs a lot more work, for two reasons-

I see an inherent contradiction between how people expect fading out to work (as a smooth gradient over an area) and having the fade gradient be true to the data (as most of the probability mass is clustered around the point estimate). Of the two, I prefer to go with representing where the probabilities are, so need to put in quite a bit of work matching amount of fade out to location of probability mass, so creating an uneven fade.
Because there is transparency, you can't take the cheat I used in the previous graph of overlaying rectangles, you need to only draw the specific range, which means drawing two times the number of rectangles you want

thresholds <- function(df){
  to10 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.1, method="exact")
  to20 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.2, method="exact")
  to30 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.3, method="exact")
  to40 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.4, method="exact")
  to50 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.5, method="exact")
  to60 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.6, method="exact")
  to70 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.7, method="exact")
  to80 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.8, method="exact")
  to90 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.9, method="exact")
  to99.7 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.997, method="exact")
  
  df_out <- data.frame(
    pt_est = to10$mean,
    p10_lower = to10$lower,
    p10_upper = to10$upper,
    p20_lower = to20$lower,
    p20_upper = to20$upper,
    p30_lower = to30$lower,
    p30_upper = to30$upper,
    p40_lower = to40$lower,
    p40_upper = to40$upper,
    p50_lower = to50$lower,
    p50_upper = to50$upper,
    p60_lower = to60$lower,
    p60_upper = to60$upper,
    p70_lower = to70$lower,
    p70_upper = to70$upper,
    p80_lower = to80$lower,
    p80_upper = to80$upper,
    p90_lower = to90$lower,
    p90_upper = to90$upper,
    p99.7_lower = to99.7$lower,
    p99.7_upper = to99.7$upper
  )
  return(df_out)
}

grf_alpha <- cyclists %>% group_by(crash_year) %>% nest() %>% 
  mutate(ci_range= map(data, thresholds)) %>% unnest(data,ci_range)

The graphing is the same principles as before, with a lot more rectangles.

ggplot(grf_alpha) +
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p10_lower,
                xmax=crash_year + 0.5,
                ymax=pt_est),
            fill=threecolors[1], alpha=1) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=pt_est,
                xmax=crash_year + 0.5,
                ymax=p10_upper),
            fill=threecolors[1], alpha=1) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p20_lower,
                xmax=crash_year + 0.5,
                ymax=p10_lower),
            fill=threecolors[1], alpha=0.9) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p10_upper,
                xmax=crash_year + 0.5,
                ymax=p20_upper),
            fill=threecolors[1], alpha=0.9) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p30_lower,
                xmax=crash_year + 0.5,
                ymax=p20_lower),
            fill=threecolors[1], alpha=0.8) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p20_upper,
                xmax=crash_year + 0.5,
                ymax=p30_upper),
            fill=threecolors[1], alpha=0.8) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p40_lower,
                xmax=crash_year + 0.5,
                ymax=p30_lower),
            fill=threecolors[1], alpha=0.7) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p30_upper,
                xmax=crash_year + 0.5,
                ymax=p40_upper),
            fill=threecolors[1], alpha=0.7) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p50_lower,
                xmax=crash_year + 0.5,
                ymax=p40_lower),
            fill=threecolors[1], alpha=0.6) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p40_upper,
                xmax=crash_year + 0.5,
                ymax=p50_upper),
            fill=threecolors[1], alpha=0.6) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p60_lower,
                xmax=crash_year + 0.5,
                ymax=p50_lower),
            fill=threecolors[1], alpha=0.5) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p50_upper,
                xmax=crash_year + 0.5,
                ymax=p60_upper),
            fill=threecolors[1], alpha=0.5) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p70_lower,
                xmax=crash_year + 0.5,
                ymax=p60_lower),
            fill=threecolors[1], alpha=0.4) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p60_upper,
                xmax=crash_year + 0.5,
                ymax=p70_upper),
            fill=threecolors[1], alpha=0.4) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p80_lower,
                xmax=crash_year + 0.5,
                ymax=p70_lower),
            fill=threecolors[1], alpha=0.3) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p70_upper,
                xmax=crash_year + 0.5,
                ymax=p80_upper),
            fill=threecolors[1], alpha=0.3) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p90_lower,
                xmax=crash_year + 0.5,
                ymax=p80_lower),
            fill=threecolors[1], alpha=0.2) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p80_upper,
                xmax=crash_year + 0.5,
                ymax=p90_upper),
            fill=threecolors[1], alpha=0.2) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p99.7_lower,
                xmax=crash_year + 0.5,
                ymax=p90_lower),
            fill=threecolors[1], alpha=0.1) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p90_upper,
                xmax=crash_year + 0.5,
                ymax=p99.7_upper),
            fill=threecolors[1], alpha=0.1) + 
  theme_tufte() + ylim(0,1)  + ggtitle("Cyclist solo accidents as proportion of accidents")

Transparency with data transformation

Because I am making a lot of rectangles, I can cut down on the graphing code by transforming the data from data of observations with confidences interval thresholds, to data of confidence interval bands of point estimates. As this makes each confidence interval range a tidy observation, it is much easier to graph.

Developing this approach needs a certain amount of thinking out what data structure answers my need, and how to I change my data into that structure. In particular, because in this approach I am including the alpha value in the data, I need to be mindful that the thresholds values order linearly through the probabilities, but the alpha values rise and fall.

thresholds <- function(df){
  to10 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.1, method="exact")
  to20 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.2, method="exact")
  to30 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.3, method="exact")
  to40 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.4, method="exact")
  to50 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.5, method="exact")
  to60 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.6, method="exact")
  to70 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.7, method="exact")
  to80 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.8, method="exact")
  to90 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.9, method="exact")
  to99.7 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.997, method="exact")
  
  df_out <- data.frame(
    t01 = to99.7$lower,
    t02 = to90$lower,
    t03 = to80$lower,
    t04 = to70$lower,
    t05 = to60$lower,
    t06 = to50$lower,
    t07 = to40$lower,
    t08 = to30$lower,
    t09 = to20$lower,
    t10 = to10$lower,
    t11 = to10$mean,
    t12 = to10$upper,
    t13 = to20$upper,
    t14 = to30$upper,
    t15 = to40$upper,
    t16 = to50$upper,
    t17 = to60$upper,   
    t18 = to70$upper,
    t19 = to80$upper,
    t20 = to90$upper,  
    t21 = to99.7$upper )
    return(df_out)
}

grf_alpha_tdy <- cyclists %>% group_by(crash_year) %>% nest() %>% 
  mutate(ci_range= map(data, thresholds)) %>% unnest(ci_range) %>%
  select(-data) %>% ungroup() %>% gather(threshold, ythreshold1, t01:t21) %>%
  group_by(crash_year) %>%  arrange(threshold) %>%
  mutate(ythreshold2 = lead(ythreshold1)) %>% filter(!is.na(ythreshold2)) %>% 
  mutate(alpha_value = c(1:10,10:1)/10) %>% ungroup() %>% arrange(crash_year, threshold)

Having transformed the data so that the observations are the things I need to achieve my goal, the graph is much easier

ggplot(grf_alpha_tdy, aes(xmin=crash_year - 0.5,
                ymin=ythreshold1,
                xmax=crash_year + 0.5,
                ymax=ythreshold2,
                alpha=alpha_value)) +
  geom_rect(fill=threecolors[1]) + 
  theme_tufte() + ylim(0,1)  +
  ggtitle("Cyclist solo accidents as proportion of accidents") + theme(legend.position="none")

thoughtfulbloke / banded_ci_demo Goto Github PK

banded_ci_demo's Introduction

Confidence Intervals

Example Data

A basic view

A banded view

Using transparency

Transparency with data transformation

banded_ci_demo's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent