rudeboybert / fivethirtyeight Goto Github PK

View Code? Open in Web Editor NEW

451.0 451.0 104.0 363.86 MB

R package of data and code behind the stories and interactives at FiveThirtyEight

Home Page: https://fivethirtyeight-r.netlify.app/

License: Other

R 55.10% Python 6.69% Jupyter Notebook 36.05% TeX 2.16%

cran data-science datajournalism fivethirtyeight r rpackage statistics

fivethirtyeight's Introduction

Albert Y. Kim's Personal Webpage

Available at http://rudeboybert.rbind.io/.

Built using RStudio's blogdown package and the Hugo Academic theme.
Deployed to Netlify. Yihui Xie's blog post convinved me to switch from GitHub pages to Netlify.
I use a different domain name than the one provided by Netlify. I opted for a free rbind.io subdomain offered by RStudio. Note this involved filing a GitHub issue.

The above is based on Alison Presmanes Hill's tutorial; take a look to get up and running quick.

fivethirtyeight's People

Contributors

Stargazers

Watchers

Forkers

irichgreen haozhu233 spencerx deepan04 taawwmm avontd2868 jd2504 cb4ds chirayukong iamkbpark hjl2014 ghosthamlet jasdumas hbcbh1999 pjbitterman safouanio mustafaascha asrosenberg lebebr01 nnsc-ott-k erpansing jonathanbouchet frankhu1089 adhok rlugojr vovietanh brooke-watson kartechbabu ismayc byuidatascience nsonneborn oleksiyanokhin patriciagoresen haidara1 radovankavicky gapdata mmanley18 maggieshea ecantu75 jpolzer jyterencekim joeklieg deerluluolivia supremumk tonydurst kashenfelter sam-blake yanliangs bdelahoussaye anandrhub davidsgrogan fishershifei kjureksmu cherishashby joeflorence jduras sam5200 laelias7 atiakor0 sydfox sgassefa starryz elaineyex rachyan nliced wangyanxumu jormacmoo niannucci ed4ubenspeck sunniraleigh kvanallen ranawg lcarpenter20 abarylsky danicamiguel jkeast aballou16 janebang ireneryan fatimak98 roroceesay davan690 azizsurani seren-smith karthy257 brendanhartnett mariumtapal fototo vinnetou jaclyndangelo officialbhattacharya seiyu32 krishnadilips ashmanmalhotra ouyang72 evelyneds siyingzhg shawnkoh manohar234 1996sushri

fivethirtyeight's Issues

Have function to install {fivethirtyeightdata} package in {fivethirtyeight}

While it is easy to run install.packages('fivethirtyeightdata', repos = 'https://fivethirtyeightdata.github.io/drat/', type = 'source'), it would be simpler to have something like a get_larger_datasets() function that we could have pop up after people have run library(fivethirtyeight) as a message letting them know. The function could just directly be a wrapper as well:

get_larger_datasets <- function() {
    install.packages(
        'fivethirtyeightdata', 
        repos = 'https://fivethirtyeightdata.github.io/drat/', 
        type = 'source'
    )
    message("Remember to use `library(fivethirtyeightdata)` to retrieve any of these larger datasets.")
}

starry's several issues

?dataset does not respond again!
mueller_approval_polls unclear column: population (2 factors: rv & a) -- wrote issue in 538
I really don't understand the column means in ncaa_womens_basketball_tournament_history. GOSH. SPORTS. NEED HELP.

Error college-recent-grads?

There seems to be something wrong with the college_recent_grads file but I couldn't yet figure out where the issue is stemming from to fix it. Perhaps someone more familiar with the package can pinpoint it quicker than me?

Or if I'm making a mistake somewhere I'd appreciate a hint, thank you!

Here are my three reprexes for comparison:

Load the data from the package:

suppressPackageStartupMessages(library(tidyverse))
library(fivethirtyeight)

college_recent_grads %>%
  arrange(unemployment_rate) %>%
  select(major, sharewomen, unemployment_rate, sample_size, men, women) %>%
  head(5)
#> # A tibble: 5 x 6
#>                                        major sharewomen unemployment_rate
#>                                        <chr>      <dbl>             <dbl>
#> 1           Mathematics And Computer Science  0.9278072       0.000000000
#> 2                                     Botany  0.5289691       0.000000000
#> 3                               Soil Science  0.7644265       0.000000000
#> 4 Educational Administration And Supervision  0.4487323       0.000000000
#> 5  Engineering Mechanics Physics And Science  0.1839852       0.006334343
#> # ... with 3 more variables: sample_size <int>, men <int>, women <int>

When I saw this I didn't believe Math & CS could have 92% women, so I looked into it a bit more.

Load the data from the rda file in the package: I downloaded college_recent_grads.rda from , and ran the same code.

suppressPackageStartupMessages(library(tidyverse))
load("~/Desktop/college-grads/data/college_recent_grads.rda")

college_recent_grads %>%
  arrange(unemployment_rate) %>%
  select(major, sharewomen, unemployment_rate, sample_size, men, women) %>%
  head(5)
#> # A tibble: 5 x 6
#>                                        major sharewomen unemployment_rate
#>                                        <chr>      <dbl>             <dbl>
#> 1           Mathematics And Computer Science  0.1789819                 0
#> 2                      Military Technologies  0.0000000                 0
#> 3                                     Botany  0.5289691                 0
#> 4                               Soil Science  0.3051095                 0
#> 5 Educational Administration And Supervision  0.6517413                 0
#> # ... with 3 more variables: sample_size <int>, men <int>, women <int>

Note that the sharewomen values are different for Mathematics And Computer Science between the two outputs.

Load the data from 538's repo:

suppressPackageStartupMessages(library(tidyverse))
data_from_538 <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv")
#> Parsed with column specification:
#> cols(
#>   .default = col_integer(),
#>   Major = col_character(),
#>   Major_category = col_character(),
#>   ShareWomen = col_double(),
#>   Unemployment_rate = col_double()
#> )
#> See spec(...) for full column specifications.

data_from_538 %>%
  arrange(Unemployment_rate) %>%
  select(Major, ShareWomen, Unemployment_rate, Sample_size, Men, Women) %>%
  head(5)
#> # A tibble: 5 x 6
#>                                        Major ShareWomen Unemployment_rate
#>                                        <chr>      <dbl>             <dbl>
#> 1           MATHEMATICS AND COMPUTER SCIENCE  0.1789819                 0
#> 2                      MILITARY TECHNOLOGIES  0.0000000                 0
#> 3                                     BOTANY  0.5289691                 0
#> 4                               SOIL SCIENCE  0.3051095                 0
#> 5 EDUCATIONAL ADMINISTRATION AND SUPERVISION  0.6517413                 0
#> # ... with 3 more variables: Sample_size <int>, Men <int>, Women <int>

This matches the .rda file but not the file loaded when the package is loaded.

JSS DOI links

Dear Albert Y. Kim,

The Journal of Statistical Software (JSS, https://www.jstatsoft.org/) recently migrated to a new server and editorial system, resulting in a change of the URLs being used for publications. Hence we checked all CRAN packages using JSS URLs in the documentation or citation files etc. This includes some of your packages: fivethirtyeight.

In general we recommend to use DOIs instead of URLs to link to JSS publications. These use the following pattern for articles: 10.18637/jss.vXXX.iYY where XXX is the three-digit volume and YY the two-digit issue. (For code snippets a "cYY" instead of "iYY" is used.) The DOIs are also shown on the web pages of the JSS articles.

For including these in a package you typically use:

\doi{...} markup in .Rd files
doi:... in DESCRIPTION/Description fields
bibentry(..., doi = ...) in CITATION files (or citEntry)

We would recommend to change all JSS references in your package correspondingly (even if redirections for the URLs are still working). The corresponding files in the package are:
fivethirtyeight/inst/doc/tame.html

Thanks for your consideration - and for referring to work published in JSS!

MRO 3.3.2

https://mran.microsoft.com/package/fivethirtyeight/ lists that you need R release > 3.2.4

I am running 3.3.2 (MRO) and get that yourpackage is not available for 3.3.2

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] lubridate_1.6.0 dplyr_0.5.0 purrr_0.2.2 readr_1.0.0 tidyr_0.6.0 tibble_1.2
[7] ggplot2_2.1.0 tidyverse_1.0.0 RevoUtilsMath_10.0.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.7 assertthat_0.1 R6_2.2.0 grid_3.3.2 plyr_1.8.4 DBI_0.5-1 gtable_0.2.0 magrittr_1.5
[9] scales_0.4.0 stringi_1.1.2 RevoUtils_10.0.2 tools_3.3.2 stringr_1.1.0 munsell_0.4.3 colorspace_1.2-7

install.packages("fivethirtyeight")
**Warning in install.packages :
package ‘fivethirtyeight’ is not available (for R version 3.3.**2)

Further data "taming" checklist

Add year_bins 5-year bins to bechdel so students can recreate 538 stacked bar plot without cut_n() function use?
Merge state.abb state abbreviations to hate_crimes so we can easily plot points with geom_text(..., labels = state_abbrev)

Replace all tidyr::gather() code in help files with pivot_longer()

Some of the help/man files have roxygen2 examples that convert data as originally given in 538's GitHub to "tidy" format using tidyr::gather(). However, tidyr now includes much more intuitively-named and easier to use functions pivot_longer() and pivot_wider(). See this tidyverse.org blog post for more information.

In the case of fivethirtyeight package, we should replace all gather() code with pivot_longer(). For example, if you run ?drinks you'll currently see in the examples:

library(dplyr)
library(tidyr)
library(stringr)
drinks_tidy <- drinks %>%
  gather(type, servings, -c(country, total_litres_of_pure_alcohol)) %>%
  mutate(
    type = str_sub(type, start=1, end=-10)
  ) %>%
  arrange(country, type)

This needs to be updated with:

library(fivethirtyeight)
library(dplyr)
library(tidyr)
library(stringr)
drinks_tidy <- drinks %>%
  pivot_longer(cols = ends_with("servings"), names_to = "type", values_to = "servings") %>%
  mutate(type = str_sub(type, start=1, end=-10)) %>%
  arrange(country, type)

A search of gather in all R/data_X.R files will locate the roxygen2 code for all such cases. For example, the roxygen2 code for drinks is in data_albert.R.

@beanumber Let me know if you have someone who can do this, and I'll explicitly make them an assignee.

Release steps

Preparing CRAN release

Make sure that all datasets listed when running data(package = "fivethirtyeight") are listed in master Google Sheet of datasets and in list of all datasets on package webpage.
Check if any of the datasets in master Google Sheet of datasets with "DYNAMIC DATA THAT GETS UPDATED?" marked "Y" need to be updated by re-running appropriate data-raw/process_data_sets_X.R files
Edit version number in DESCRIPTION and NEWS.md.

The following steps ensure that all user-contributed vignettes are not included in the package on CRAN, but rather only in the development version on GitHub and on the package webpage:

Temporarily remove all .Rmd vignettes from vignettes/ except for fivethirtyeight.Rmd (includes a detailed list of all data), tame.Rmd (TISE article), and user_contributed_vignettes.Rmd (how to access user-contributed vignettes).
Then clear all non-source files from vignettes/, in other words, all .html and .R files, so that they don't interfere with CRAN submission.
Temporarily edit packages in DESCRIPTION to reflect above vignette changes. This will keep package dependency bloat down:
- Keep the following packages needed for all @Examples in help files and the fivethirtyeight, tame, and user_contributed_vignettes vignettes. As of v0.5.0, these include: ggplot2, dplyr, tidyr, curl, readr, tibble, lubridate, stringr, janitor, knitr, gridExtra, ggthemes, scales, broom, magrittr, rmarkdown
- Temporarily remove the following packages used in all other vignettes. As of v0.5.0, these include: slam (>= 0.1-42), highcharter (>= 0.7), tidytext, textdata, hunspell, fmsb, wordcloud, corrplot, ggraph, igraph

Standard package steps:

Do one final check
Run the following devtools functions: spell_check(), check_rhub(), check_win_devel(), and check_win_release(). You'll eventually be asked if you ran them when publishing release via devtools::release()
Update cran-comments.md.

After CRAN release

Tag version in GitHub releases
Return temporarily removed vignettes
Return packages removed from DESCRIPTION
Edit version number in DESCRIPTION and NEWS.md

csv's in README rather than data-raw folder

I wanted to convert some data files from the data-raw folder into .rda files so more data will be available to users of the package, but I noticed that some of the data is only in the README.md file, and not in the actual data-raw folder themselves. I was wondering why this is the case and if the way to extract them is the same?

example: nba-forecasts

Issue with highcharter in biopics.Rmd

Hi @adhok, I hope you've been well. There seems to be an issue in your biopics.Rmd vignette, specifically line 127 and 146 where you call highcharter::hchart(). I'm getting the error

Error in mutate_impl(.data, dots) : Column `x` is of unsupported type quoted call

For now for package compilation, I've

Set # library(highcharter)
Set eval=FALSE for the above two code chunks

Could you make a pull request fixing the above bug? thanks

Tidy Data Principles for Intro Stats & Data Science Courses: The `fivethirtyeight` R Package

Hi @hadley! First off, thanks for looking at our package and making pull requests #1 and #2. We're honoured that you're watching!

@chesterismay, @jchunn, and I are working on a Technology Innovations in Statistics Education (TISE) paper describing the fivethirtyeight package: Tidy Data Principles for Intro Stats & Data Science Courses: The fivethirtyeight R Package.

The outline is in fivethirtyeight/TISE. We would love your input, in particular with regards to our Proposal, outlined in the "The Proposal" section.

Any advice/input you could offer would be much appreciated.

Bachelorette Dataset

Looking at the original data on 538's GitHub, it seems the variables in the R package are encoded wrong.

For example in season 13 of the Bachelorette, everyone that was eliminated on night 1 has a 2 next to their name in the elimination_1 column – not clear why it is a 2. Also, I’m not seeing an R1 anywhere for Bryan A who got the first impression rose.

Add citation

Add citation() to upcoming "Tame Data" principles article.

Steps to take before CRAN update

Before running the standard devtools::release() steps, take the following steps relating to the master list of datasets in this Google Sheet

Update the Google Sheet with all new datasets
Update any of the datasets in the Google Sheet with "DYNAMIC DATA THAT GETS UPDATED?" marked "Y". Do this by re-running appropriate lines in data-raw/process_data_sets.R
Re-run data-raw/rebuild_master_dataset_list.R to rebuild the datasets_master data object used in vignettes/fivethirtyeight.Rmd
Ensure that the output of data(package = "fivethirtyeight") matches datasets_master
Check rhub via devtools::check_rhub(env_vars=c(R_COMPILE_AND_INSTALL_PACKAGES = "always"))

Table of appropriate analyses

Maybe a good task for a student? Create a table of all of the datasets and the types of problems each dataset would be good for:

Descriptive (plots, summary stats)
Inference (specify types of variables included and how many)
Modeling (regression, multiple regression, etc.)

New datasets

Not an issue per se, but it would be helpful if in NEWS.md you could list new datasets in each release
Tx

Dummy variable coefficients for ordered factors comes out wonky in regression tables

Might need to unconvert all ordered = TRUE factors to unordered.

suppressPackageStartupMessages(library(tidyverse))
library(fivethirtyeight)
library(moderndive)

# clean_test is ordered factor
bechdel$clean_test[1:5]
#> [1] notalk ok     notalk notalk men   
#> Levels: nowomen < notalk < men < dubious < ok

# weird output for dummy variables in regression table
lm(domgross~clean_test, data = bechdel) %>% 
  get_regression_table()
#> Warning: package 'bindrcpp' was built under R version 3.4.4
#> # A tibble: 5 x 7
#>   term           estimate std_error statistic p_value   conf_low conf_high
#>   <chr>             <dbl>     <dbl>     <dbl>   <dbl>      <dbl>     <dbl>
#> 1 intercept     70491451.  2412903.    29.2    0.      65759017. 75223886.
#> 2 clean_test.L  -3804412.  5244171.    -0.725  0.468  -14089823.  6480999.
#> 3 clean_test.Q  -9916737.  5398245.    -1.84   0.0660 -20504334.   670861.
#> 4 clean_test.C    750590.  5353012.     0.140  0.889   -9748292. 11249471.
#> 5 clean_test^4  -9507099.  5580759.    -1.70   0.0890 -20452662.  1438464.
lm(domgross~clean_test, data = bechdel) %>% 
  summary()
#> 
#> Call:
#> lm(formula = domgross ~ clean_test, data = bechdel)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -79345264 -52417225 -25031559  22438057 691533349 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  70491451    2412903  29.214   <2e-16 ***
#> clean_test.L -3804412    5244171  -0.725   0.4683    
#> clean_test.Q -9916736    5398245  -1.837   0.0664 .  
#> clean_test.C   750590    5353012   0.140   0.8885    
#> clean_test^4 -9507099    5580759  -1.704   0.0886 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 80100000 on 1772 degrees of freedom
#>   (17 observations deleted due to missingness)
#> Multiple R-squared:  0.008974,   Adjusted R-squared:  0.006737 
#> F-statistic: 4.012 on 4 and 1772 DF,  p-value: 0.003042

# should look like
bechdel %>% 
  mutate(clean_test = factor(clean_test, ordered = FALSE)) %>% 
  lm(domgross~clean_test, data = .) %>% 
  get_regression_table()
#> # A tibble: 5 x 7
#>   term            estimate std_error statistic p_value  conf_low conf_high
#>   <chr>              <dbl>     <dbl>     <dbl>   <dbl>     <dbl>     <dbl>
#> 1 intercept         6.62e7  6793664.     9.75   0.        5.29e7 79547620.
#> 2 clean_testnot…    1.31e7  7663750.     1.72   0.0870   -1.89e6 28172609.
#> 3 clean_testmen     2.75e6  8910344.     0.309  0.758    -1.47e7 20226986.
#> 4 clean_testdub…    9.79e6  9573562.     1.02   0.307    -8.99e6 28562779.
#> 5 clean_testok     -4.34e6  7364354.    -0.589  0.556    -1.88e7 10106207.
bechdel %>% 
  mutate(clean_test = factor(clean_test, ordered = FALSE)) %>% 
  lm(domgross~clean_test, data = .) %>% 
  summary()
#> 
#> Call:
#> lm(formula = domgross ~ clean_test, data = .)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -79345264 -52417225 -25031559  22438057 691533349 
#> 
#> Coefficients:
#>                   Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)       66223181    6793664   9.748   <2e-16 ***
#> clean_testnotalk  13141668    7663750   1.715   0.0866 .  
#> clean_testmen      2751095    8910344   0.309   0.7575    
#> clean_testdubious  9786117    9573562   1.022   0.3068    
#> clean_testok      -4337528    7364354  -0.589   0.5559    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 80100000 on 1772 degrees of freedom
#>   (17 observations deleted due to missingness)
#> Multiple R-squared:  0.008974,   Adjusted R-squared:  0.006737 
#> F-statistic: 4.012 on 4 and 1772 DF,  p-value: 0.003042

Created on 2018-04-04 by the reprex package (v0.2.0).

Vignette re-organization

Some vignette reorganization:

Add a vignette that has a link to all user-contributed vignettes, which are only available on dev version of package and on GitHub due to CRAN package size restrictions.
For all datasets that have a user contributed vignette, add a link to it in the corresponding help/roxygen code file.

Add SDS390 datasets to master Google Sheet

Add following datasets to Master Google Sheet. The contents of this Google Sheet ultimately get fed into this webpage listing all datasets via pkgdown's compilation of vignettes/fivethirtyeight.Rmd.

Large/numerous datasets are exceeding CRAN pkg size restrictions

Given the increase of number and size of datasets, we're hitting CRAN package size restrictions. Our hack workaround so far has been to include only the first 10 rows of the larger datasets (see here for a list of which ones). There are many more in the latest round of data additions (see #79).

Two potential solutions are to either:

Use the drat package
Do whatever the https://github.com/ropensci/USAboundariesData package does

data format of riddler castles

Hi Albert, I went through the tidying process for the riddler castles datasets but then realized they might already be tidy? This is what I transformed it into in order to reduce having all the castle_n variables, but it makes the strategy column really redundant. Thoughts? Thanks!

Failure to find functions in Bechdel analysis vignette

When running the final two ggplot codes within the vignette, error messages are returned as follows;-

Error in scale_fill_fivethirtyeight() :
could not find function "scale_fill_fivethirtyeight"

Error in theme_fivethirtyeight() :
could not find function "theme_fivethirtyeight"

Variable documentation

I'm working on the democratic primary candidates 2018 csv and I have several questions.

The README file says the variable gender exists in the dem-candidates dataset but it doesn't. Should I remove this from the README file?
The README file has descriptions for each variable already. Should I copy these descriptions for the description file I'm creating for each variable?
I'm not sure what to put for the source in the description file. A lot of the variables have different sources. For instance, some say "Supplied by Ballotpedia," some say they're from the candidate's website, some from NYTimes, etc.

Use \dontrun{} in @examples roxygen code instead of if(FALSE)

For example, in house_district_forecast help file in R/data_starry.R, we use if(FALSE) to not have code in example run. Change all these to \dontrun{}

Note from CRAN

Dear maintainer,

You have file 'fivethirtyeight/man/fivethirtyeight.Rd' with
\docType{package}, likely intended as a package overview help file, but
without the appropriate PKGNAME-package \alias as per "Documenting
packages" in R-exts.

This seems to be the consequence of the breaking change

Using @doctype package no longer automatically adds a -package alias.
Instead document _PACKAGE to get all the defaults for package
documentation.

in roxygen2 7.0.0 (2019-11-12) having gone unnoticed, see
r-lib/roxygen2#1491.

As explained in the issue, to get the desired PKGNAME-package \alias
back, you should either change to the new approach and document the new
special sentinel

"_PACKAGE"

or manually add

@Aliases fivethirtyeight-package

if remaining with the old approach.

Please fix in your master sources as appropriate, and submit a fixed
version of your package within the next few months.

Best,
-k

Update to `comic_characters` vignette

Hi @jonathanbouchet

In the next release of the fivethirtyeight package, we are transitioning large datasets, including comic_characters (used in your user-submitted vignette) to the fivethirtyeightdata package, where they will be available in full instead of a preview.

Since the dataset is read-in and cleaned manually (with the code provided in the help file), the vignette doesn't need to be changed, but we would like to include a message indicating the updated location of the dataset.

Please let us know if this is okay!

Need mechanism to indicate "live" .csv vs static csv

For example soccer-spi, governors-forecast-2018, etc, so that everytime we update the package, we make sure to update data in data/ folder by running read_csv("URL") code in data-raw/.

[ratings dataset] Ambiguity in the "category" column description.

I think that the description category column in the rating dataset might be ambiguous .

> levels(ratings$category)
 [1] "Aged 18-29"         "Aged 30-44"         "Aged 45+"           "Aged under 18"      "Females"           
 [6] "Females Aged 18-29" "Females Aged 30-44" "Females Aged 45+"   "Females under 18"   "IMDb staff"        
[11] "IMDb users"         "Males"              "Males Aged 18-29"   "Males Aged 30-44"   "Males Aged 45+"    
[16] "Males under 18"     "Non-US users"       "Top 1000 voters"    "US users"

Because there could be questions like:

Is the Males under 18 a subset of all Males, and if not, how do the categories differ?
Is there any intersection between the categories?
If the number of respondents in 'Females Aged 18-29'+'Females Aged 30-44'+'Females Aged 45+'+'Females under 18' are less that the number of respondents in the Female category. Is the gap due to respondents with unknown age?

I checked an example on IMDB, but I am not sure how things sum up in the dataset.