Code Monkey home page Code Monkey logo

banded_ci_demo's Introduction

title author date output
Banded or faded confidence intervals
David Hood
3/10/2018
html_document word_document
keep_md
true
default

Confidence Intervals

I don't like confidence intervals as normally presented (though I acknowledge it is often for journal style/economics reasons outside of the author's control), so here is two alternatives: banded and fade-out, for when you have the freedom to do better justice to the data.

I initially began doing these with a mass of custom code in base plot. This is a ggplot2 version, involving a bit less code and a few more libraries. This is very much an "intermediate" R walkthrough of what I do, in as much as I am not explaining ggplot or data transformation, just showing how I do one particular affect.

library(dplyr)
library(tidyr)
library(binom)
library(purrr)
library(ggplot2)
library(ggthemes)

Example Data

I am using New Zealand hospital discharge data. The small csv file contains the variables:

crash_year - calendar year the accident took place in

selfowns - the number of accidents which were either the cyclist crashing into a stationary object, or a non-collision accident (so no other party involved) such as falling over

occurances - the total number of accidents, of which selfowns is a subset

This is real data, but this is also a graphing demonstration so the actual data is not the focus.

cyclists <- read.csv("hospital_cycle_discharges.csv")
threecolors <- c("#428543", "#77C579", "#DCEEA2")

A basic view

First, a traditional, nuance free, graph.

cyclists %>% mutate(proportion_selfown = selfowns/occurances) %>%
  ggplot(aes(x=crash_year, y=proportion_selfown)) + geom_bar(stat="identity") + 
  theme_tufte() + ylim(0,1) + ggtitle("Cyclist solo accidents as proportion of accidents")

A banded view

Calculating out solid colour bands takes a bit of work (it could be approximated in easier ways, but I am doing the full work as the confidence intervals are not always symmetric around the point estimate with binomials). For this nest/models/unnest approach see the book R for Data Science, by Hadley Wickham, the many models section.

In short, I am calculating multiple binomial confidence thresholds for each entry.

thresholds <- function(df){
  c1 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.68, method="exact")
  c2 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.95, method="exact")
  c3 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.997, method="exact")
  df_out <- data.frame(
    pt_est = c1$mean,
    c1_lower = c1$lower,
    c1_upper = c1$upper,
    c2_lower = c2$lower,
    c2_upper = c2$upper,
    c3_lower = c3$lower,
    c3_upper = c3$upper
  )
  return(df_out)
}

grf_band <- cyclists %>% group_by(crash_year) %>% nest() %>% 
  mutate(ci_range= map(data, thresholds)) %>% unnest(data,ci_range)

Next, I use those thresholds to draw rectangles

ggplot(grf_band) +
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=c3_lower,
                xmax=crash_year + 0.5,
                ymax=c3_upper),
            fill=threecolors[3]) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=c2_lower,
                xmax=crash_year + 0.5,
                ymax=c2_upper),
            fill=threecolors[2]) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=c1_lower,
                xmax=crash_year + 0.5,
                ymax=c1_upper),
            fill=threecolors[1]) + 
    geom_rect(aes(xmin=crash_year - 0.5,
                ymin=pt_est+0.001,
                xmax=crash_year + 0.5,
                ymax=pt_est-0.001),
            fill="#000000") + 
  theme_tufte() + ylim(0,1) + ggtitle("Cyclist solo accidents as proportion of accidents")

Using transparency

I feel that using transparency needs a lot more work, for two reasons-

  • I see an inherent contradiction between how people expect fading out to work (as a smooth gradient over an area) and having the fade gradient be true to the data (as most of the probability mass is clustered around the point estimate). Of the two, I prefer to go with representing where the probabilities are, so need to put in quite a bit of work matching amount of fade out to location of probability mass, so creating an uneven fade.

  • Because there is transparency, you can't take the cheat I used in the previous graph of overlaying rectangles, you need to only draw the specific range, which means drawing two times the number of rectangles you want

thresholds <- function(df){
  to10 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.1, method="exact")
  to20 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.2, method="exact")
  to30 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.3, method="exact")
  to40 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.4, method="exact")
  to50 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.5, method="exact")
  to60 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.6, method="exact")
  to70 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.7, method="exact")
  to80 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.8, method="exact")
  to90 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.9, method="exact")
  to99.7 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.997, method="exact")
  
  df_out <- data.frame(
    pt_est = to10$mean,
    p10_lower = to10$lower,
    p10_upper = to10$upper,
    p20_lower = to20$lower,
    p20_upper = to20$upper,
    p30_lower = to30$lower,
    p30_upper = to30$upper,
    p40_lower = to40$lower,
    p40_upper = to40$upper,
    p50_lower = to50$lower,
    p50_upper = to50$upper,
    p60_lower = to60$lower,
    p60_upper = to60$upper,
    p70_lower = to70$lower,
    p70_upper = to70$upper,
    p80_lower = to80$lower,
    p80_upper = to80$upper,
    p90_lower = to90$lower,
    p90_upper = to90$upper,
    p99.7_lower = to99.7$lower,
    p99.7_upper = to99.7$upper
  )
  return(df_out)
}

grf_alpha <- cyclists %>% group_by(crash_year) %>% nest() %>% 
  mutate(ci_range= map(data, thresholds)) %>% unnest(data,ci_range)

The graphing is the same principles as before, with a lot more rectangles.

ggplot(grf_alpha) +
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p10_lower,
                xmax=crash_year + 0.5,
                ymax=pt_est),
            fill=threecolors[1], alpha=1) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=pt_est,
                xmax=crash_year + 0.5,
                ymax=p10_upper),
            fill=threecolors[1], alpha=1) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p20_lower,
                xmax=crash_year + 0.5,
                ymax=p10_lower),
            fill=threecolors[1], alpha=0.9) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p10_upper,
                xmax=crash_year + 0.5,
                ymax=p20_upper),
            fill=threecolors[1], alpha=0.9) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p30_lower,
                xmax=crash_year + 0.5,
                ymax=p20_lower),
            fill=threecolors[1], alpha=0.8) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p20_upper,
                xmax=crash_year + 0.5,
                ymax=p30_upper),
            fill=threecolors[1], alpha=0.8) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p40_lower,
                xmax=crash_year + 0.5,
                ymax=p30_lower),
            fill=threecolors[1], alpha=0.7) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p30_upper,
                xmax=crash_year + 0.5,
                ymax=p40_upper),
            fill=threecolors[1], alpha=0.7) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p50_lower,
                xmax=crash_year + 0.5,
                ymax=p40_lower),
            fill=threecolors[1], alpha=0.6) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p40_upper,
                xmax=crash_year + 0.5,
                ymax=p50_upper),
            fill=threecolors[1], alpha=0.6) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p60_lower,
                xmax=crash_year + 0.5,
                ymax=p50_lower),
            fill=threecolors[1], alpha=0.5) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p50_upper,
                xmax=crash_year + 0.5,
                ymax=p60_upper),
            fill=threecolors[1], alpha=0.5) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p70_lower,
                xmax=crash_year + 0.5,
                ymax=p60_lower),
            fill=threecolors[1], alpha=0.4) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p60_upper,
                xmax=crash_year + 0.5,
                ymax=p70_upper),
            fill=threecolors[1], alpha=0.4) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p80_lower,
                xmax=crash_year + 0.5,
                ymax=p70_lower),
            fill=threecolors[1], alpha=0.3) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p70_upper,
                xmax=crash_year + 0.5,
                ymax=p80_upper),
            fill=threecolors[1], alpha=0.3) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p90_lower,
                xmax=crash_year + 0.5,
                ymax=p80_lower),
            fill=threecolors[1], alpha=0.2) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p80_upper,
                xmax=crash_year + 0.5,
                ymax=p90_upper),
            fill=threecolors[1], alpha=0.2) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p99.7_lower,
                xmax=crash_year + 0.5,
                ymax=p90_lower),
            fill=threecolors[1], alpha=0.1) + 
  geom_rect(aes(xmin=crash_year - 0.5,
                ymin=p90_upper,
                xmax=crash_year + 0.5,
                ymax=p99.7_upper),
            fill=threecolors[1], alpha=0.1) + 
  theme_tufte() + ylim(0,1)  + ggtitle("Cyclist solo accidents as proportion of accidents")

Transparency with data transformation

Because I am making a lot of rectangles, I can cut down on the graphing code by transforming the data from data of observations with confidences interval thresholds, to data of confidence interval bands of point estimates. As this makes each confidence interval range a tidy observation, it is much easier to graph.

Developing this approach needs a certain amount of thinking out what data structure answers my need, and how to I change my data into that structure. In particular, because in this approach I am including the alpha value in the data, I need to be mindful that the thresholds values order linearly through the probabilities, but the alpha values rise and fall.

thresholds <- function(df){
  to10 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.1, method="exact")
  to20 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.2, method="exact")
  to30 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.3, method="exact")
  to40 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.4, method="exact")
  to50 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.5, method="exact")
  to60 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.6, method="exact")
  to70 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.7, method="exact")
  to80 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.8, method="exact")
  to90 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.9, method="exact")
  to99.7 <- binom.confint(df$selfowns[1], df$occurances[1], conf.level = 0.997, method="exact")
  
  df_out <- data.frame(
    t01 = to99.7$lower,
    t02 = to90$lower,
    t03 = to80$lower,
    t04 = to70$lower,
    t05 = to60$lower,
    t06 = to50$lower,
    t07 = to40$lower,
    t08 = to30$lower,
    t09 = to20$lower,
    t10 = to10$lower,
    t11 = to10$mean,
    t12 = to10$upper,
    t13 = to20$upper,
    t14 = to30$upper,
    t15 = to40$upper,
    t16 = to50$upper,
    t17 = to60$upper,   
    t18 = to70$upper,
    t19 = to80$upper,
    t20 = to90$upper,  
    t21 = to99.7$upper )
    return(df_out)
}

grf_alpha_tdy <- cyclists %>% group_by(crash_year) %>% nest() %>% 
  mutate(ci_range= map(data, thresholds)) %>% unnest(ci_range) %>%
  select(-data) %>% ungroup() %>% gather(threshold, ythreshold1, t01:t21) %>%
  group_by(crash_year) %>%  arrange(threshold) %>%
  mutate(ythreshold2 = lead(ythreshold1)) %>% filter(!is.na(ythreshold2)) %>% 
  mutate(alpha_value = c(1:10,10:1)/10) %>% ungroup() %>% arrange(crash_year, threshold)

Having transformed the data so that the observations are the things I need to achieve my goal, the graph is much easier

ggplot(grf_alpha_tdy, aes(xmin=crash_year - 0.5,
                ymin=ythreshold1,
                xmax=crash_year + 0.5,
                ymax=ythreshold2,
                alpha=alpha_value)) +
  geom_rect(fill=threecolors[1]) + 
  theme_tufte() + ylim(0,1)  +
  ggtitle("Cyclist solo accidents as proportion of accidents") + theme(legend.position="none")

banded_ci_demo's People

Contributors

thoughtfulbloke avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.