There seems to be something wrong with the college_recent_grads
file but I couldn't yet figure out where the issue is stemming from to fix it. Perhaps someone more familiar with the package can pinpoint it quicker than me?
Or if I'm making a mistake somewhere I'd appreciate a hint, thank you!
Here are my three reprexes for comparison:
- Load the data from the package:
suppressPackageStartupMessages(library(tidyverse))
library(fivethirtyeight)
college_recent_grads %>%
arrange(unemployment_rate) %>%
select(major, sharewomen, unemployment_rate, sample_size, men, women) %>%
head(5)
#> # A tibble: 5 x 6
#> major sharewomen unemployment_rate
#> <chr> <dbl> <dbl>
#> 1 Mathematics And Computer Science 0.9278072 0.000000000
#> 2 Botany 0.5289691 0.000000000
#> 3 Soil Science 0.7644265 0.000000000
#> 4 Educational Administration And Supervision 0.4487323 0.000000000
#> 5 Engineering Mechanics Physics And Science 0.1839852 0.006334343
#> # ... with 3 more variables: sample_size <int>, men <int>, women <int>
When I saw this I didn't believe Math & CS could have 92% women, so I looked into it a bit more.
- Load the data from the rda file in the package: I downloaded
college_recent_grads.rda
from , and ran the same code.
suppressPackageStartupMessages(library(tidyverse))
load("~/Desktop/college-grads/data/college_recent_grads.rda")
college_recent_grads %>%
arrange(unemployment_rate) %>%
select(major, sharewomen, unemployment_rate, sample_size, men, women) %>%
head(5)
#> # A tibble: 5 x 6
#> major sharewomen unemployment_rate
#> <chr> <dbl> <dbl>
#> 1 Mathematics And Computer Science 0.1789819 0
#> 2 Military Technologies 0.0000000 0
#> 3 Botany 0.5289691 0
#> 4 Soil Science 0.3051095 0
#> 5 Educational Administration And Supervision 0.6517413 0
#> # ... with 3 more variables: sample_size <int>, men <int>, women <int>
Note that the sharewomen
values are different for Mathematics And Computer Science between the two outputs.
- Load the data from 538's repo:
suppressPackageStartupMessages(library(tidyverse))
data_from_538 <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv")
#> Parsed with column specification:
#> cols(
#> .default = col_integer(),
#> Major = col_character(),
#> Major_category = col_character(),
#> ShareWomen = col_double(),
#> Unemployment_rate = col_double()
#> )
#> See spec(...) for full column specifications.
data_from_538 %>%
arrange(Unemployment_rate) %>%
select(Major, ShareWomen, Unemployment_rate, Sample_size, Men, Women) %>%
head(5)
#> # A tibble: 5 x 6
#> Major ShareWomen Unemployment_rate
#> <chr> <dbl> <dbl>
#> 1 MATHEMATICS AND COMPUTER SCIENCE 0.1789819 0
#> 2 MILITARY TECHNOLOGIES 0.0000000 0
#> 3 BOTANY 0.5289691 0
#> 4 SOIL SCIENCE 0.3051095 0
#> 5 EDUCATIONAL ADMINISTRATION AND SUPERVISION 0.6517413 0
#> # ... with 3 more variables: Sample_size <int>, Men <int>, Women <int>
This matches the .rda file but not the file loaded when the package is loaded.