rfordatascience / tidytuesday Goto Github PK

View Code? Open in Web Editor NEW

6.5K 6.5K 2.4K 1.92 GB

Official repo for the #tidytuesday project

License: Creative Commons Zero v1.0 Universal

R 0.47% HTML 98.93% JavaScript 0.23% CSS 0.37%

tidytuesday's People

Contributors

Stargazers

Watchers

Forkers

songeo mcdussault r-forks-to-learn blewis49 ropolomx ogorodriguez krohitm skvempati jklaise damrine swmpkim frankfarach shahnurislam mhamine drmattg sastoudt dpseidel ofr1tz ryo-n7 grspur jimmyday12 andrew10043 nsgrantham theparttimeanalyst renikaul dshkol jiddualexander carrie001 miguelcos laderast kgilds fingertipsy miyabiishihara eleakin nbenzakour jonleslie diego-ellis-soto sshenriques ericbaldwin thewiremonkey aboland red5247 silvioaugustojr dataunirio altrickter dwhdai emily-xing joesalami eugejoh olummy privlko oleksiyanokhin xvrdm marialma joestoica j450h1 dloewenstein 3dan3 daltonaaron gana-bridget michaelgeobrown rikagx jihongz tuqmano jessejxzhao jas1 desautm rastechie veerlevanson tanjakec cguedenet ebucksjeff douha1995 chas-mellish ncov evamaerey ivyzhang159 lejarx batermj tianan2 hannahblackburn andrei-wonge cyranka jcborders dataeducation nicvel ksnyder903 karaflorez donalbonny kdorman42 yesgirl3 tylerjrichards blacng hdekk mcc67 jasonrdatasci jihonggggg xuxuejing95 cbirunda jsakv

tidytuesday's Issues

Link to 2018 doesn't work

Clicking on the link to 2018 data shows a 404 error

Google Reviews for Train Stations 2019-02-26

I manually created a spreadsheet which will allow you to merge the Google Review score (as of 2019-02-25) by name of the departure or arrival station. Just wanted to offer share in case others would find it useful! I'm new to Github so should you want the file, please let me know the best way to get it to you.

note: I did this by hand and can't speak French very well, so take the scores with a grain of salt

What's happened in a world last month: world news analysis

Hello! I've started a longtime project on a global news analysis. What if we could collect the most popular news from different countries and continents? What if we would do it every day for a… let's say one month or year?
https://medium.com/@storozhenko.dmitry/whats-happened-in-a-world-last-month-world-news-analysis-b7e540d45d64

If somebody would like to join, please let me know:)

R Package Presidential Election Polls from 1980 to 2016

Data can be found here and is part of a package. I discovered the data here - https://twitter.com/gelliottmorris/status/1089612612474732544.

Kentucky Open Ed Data

https://t.co/FcBzUxzWD7

from https://twitter.com/kristophrdelane

Buzzfeed news datasets

https://github.com/BuzzFeedNews

Add to community resources

https://github.com/vincentarelbundock/countrycode

Clarify Code posting on Twitter

The readme states

Include a copy of the code used to create your visualization when you post to Twitter.

Is there a preferred method? For example, tweet the graphic with the hash tag, then reply to your own tweet with a link? Or as a Twitter essay? Or as an attachment? If attachment, then as UTF-8 text file?

2018-12-11#data-dictionary: inspection_type is not Date & Time field. You probably meant to put "inspection_date" here.

2018-12-11 README Data Dictionary reads:
"

inspection_type	This field represents the date of inspection; NOTE: Inspection dates of 1/1/1900 mean an establishment has not yet had an inspection	Date & Time

"
But inspection_type is not Date & Time field. You probably meant to put "inspection_date" here. You have inspection_type at the end of the table. i.e.

inspection_date	This field represents the date of inspection; NOTE: Inspection dates of 1/1/1900 mean an establishment has not yet had an inspection	Date & Time

November 20th Transgender Day of Remembrance data

I know it is short notice, but tomorrow is the Transgender Day of Remembrance. Forwards, Rainbow R and Cardiff RUG held a datathon at the weekend to work on a dataset of reports of killings and suicides of transgender people, who will be memorialized on TDoR. More about the datathon can be found here: https://github.com/rlgbtq/TDoR2018. The data is now available as an R package: https://github.com/CaRdiffR/tdor. Although a difficult subject, it would be great if R folk could explore this data for Tidy Tuesday and raise awareness of TDoR.

Beach Volleyball Match Data

I publish a dataset of all matches played on the AVP and FIVB professional beach volleyball circuits:

https://github.com/BigTimeStats/beach-volleyball

There is some intro code to access and start querying the data. Would be happy to write additional posts to generate some visuals.

Some ideas to work more generally with the data:

City/Country could be used to generate geographic maps and/or travel patterns from tournament to tournament or season to season
Contains both the men's and women's tours along with demographic information
Contains match stats data like aces, errors, kills, etc. that can be analyzed further

Let me know your thoughts.

Rstudio conf R learning survey data

Very interesting dataset for the R community. Survey conducted prior to Rstudio conf 2019 abotu R users and how they are learning R.

https://github.com/rstudio/learning-r-survey

Malaria data challenge

https://www.synapse.org/#!Synapse:syn16788291/wiki/583310

Malaria data challenge opens Nov 12th

Week 17 Data

Week 17 Data Upload Temp

Analyze Public Opinion Survey on LA2028 Olympics

NoOlympics LA conducted a large survey (>1000 respondents) on public opinion regarding the Olympics. This provides real survey responses for TidyTuesday participants to analyze and visualize.

Github link with more info, context and basic analyses.
https://github.com/NOlympicsLA/Olympics-Public-Survey

Finding Story in Kaggle Machine Learning and Data Science Survey

Kaggle is having a competition to find a story in their Machine Learning and Data Science Survey results.
https://www.kaggle.com/kaggle/kaggle-survey-2018/home

Challenge ends December 3rd, but there are weekly prizes as well up until that point.

FEMA data from NPR

NPR story on who benefits most from FEMA relief. Data are here. Part of their investigation, "How Federal Disaster Money Favors The Rich"

Medium Article Scrape & Analysis

Someone did a web scrape of 1.4 million Medium articles between 8/2017-8/2018, including:

Title
Sub-title
Author
Publication Date
Tags
Read-Time
Claps-Received
Story URL
Author URL

Data:
https://www.kaggle.com/harrisonjansma/medium-stories

Article - removed as the link is no longer active.
~~https://towardsdatascience.com/i-just-published-a-massive-dataset-of-medium-stories-heres-the-link-to-get-it-889bab324138~~

Github:
https://github.com/harrisonjansma/Analyzing_Medium

Ireland litter data

This isn't linked to any existing article, but there's some interesting open data from OpenLitterMap - a crowdsourced map that captured different types of litter data.

Ireland is currently on the leaderboard with 18k verified pieces of litter.
https://openlittermap.com/en/maps

You need to login to be able to download but I've spoken to the dev/owner who says the data is freely available to anyone to use. Can be accessed via direct download (4.5MB file) https://openlittermap.com/maps/Ireland/download

Repository with many data sets

https://www.figure-eight.com/data-for-everyone/

Image descriptions
Judge emotions about nuclear energy from Twitter
Decide whether two English sentences are related
Similarity judgment of word combinations
Sentiment Analysis – Global Warming/Climate Change
Judge Emotion About Brands & Products
Colors in 9 Languages
Claritin Twitter
Sentence plausibility
Academy Awards demographics
Agreement between long and short sentences
Company categorizations (with URLs)

Thanks for setting up Tidy Tuesday in the first place!!

Wrong URL for obs_gender

Just wanted to let you know that the URL used for the obs_gender variable is incorrect. I believe it should point to the jobs_gender.csv file instead of the gender_earnings.csv file.

European Social Survey and Weather Canada survey

Hi @jthomasmock it was great meeting you today. As I mentioned, a couple of cool data resources include the European Social Survey which has an API package available in CRAN (essurvey). Another one is weather data from Environment Canada, which is available via weathercan.

Another one I didn't mention, but I think you may find a pretty fun challenge is bikesharing data, which is made easy to get using the bikedata package.

Anime Dataset?

Hi, I found this anime list dataset is fascinating and I wonder if you could do a webcast on this. The link of the dataset is here: .

It shows the ranking of anime based on different criteria.

Twitter Elections Integrity Datasets

https://about.twitter.com/en_us/values/elections-integrity.html#data
"Twitter is making publicly available archives of Tweets and media that we believe resulted from potentially state-backed information operations on our service."

Seattle Pet Names

https://t.co/PdfzaETLj2

https://twitter.com/skyetetra/status/1093737135847309312

Supreme Court Confirmation Hearing Transcripts

https://www.rstreet.org/2019/04/04/supreme-court-confirmation-hearing-transcripts-as-data/

OECD Data

The Organisation for Economic Co-operation and Development (OECD) has an enormous amount of data on a variety of topics. You could probably do an entire year of tidytuesdays on their data alone.

https://data.oecd.org/

The only issue is that the data comes in fairly clean, so less of a learning opportunity on the tidying side.

Freddie Mac data

https://twitter.com/jkregenstein/status/1088870894536134656

http://www.freddiemac.com/research/indices/house-price-index.html

France wind turbines

https://opendata-renewables.engie.com/explore/dataset/la-haute-borne-data-2013-2016/

Student Diversity

https://www.chronicle.com/interactives/student-diversity-2018
"race, ethnicity, and gender of students at 4,342 colleges and universities in the fall of 2016"

I'm pretty sure the data comes from the data table 12-Month Enrollment (12-month unduplicated headcount: 2016-17) found at https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx
*choose 2017 (this is where the 2016 school year data lives), then all surveys

To get the institution names merge to the Institutional Characteristics (Directory information) table.

Proposal for week of 4/20: is the cannabis "holiday" related to car crashes?

A few months ago, Harper and Palayew[1] published a study looking at whether a signal could be detected in fatal car crashes in the United States based on the "4/20" holiday, based on a previous study by Staples and Redelmeier[2] that suggested a strong link. Using more robust methods and a more comprehensive time window, Harper and Palayew could not find a signal for 4/20, but could for other holidays, such as July 4.

This is a great example of how charts can mislead based on choices in analysis and plotting.

Some of Harper and Palayew's analysis was done in R, but more was done in Stata and Stan. Their manuscript and their original data/code is at https://osf.io/qnrg6/. I built a script to download their original raw data and tidy it up into datasets they used in their paper as a possible #tidytuesday activity using R. Other dataset creation from the raw data and additional tidying possibilities exist, of course.

The entire script, which includes a couple of starter plots, is at https://github.com/Rmadillo/Harper_and_Palayew/blob/master/Load_Data_and_Clean.R, but you can download and tidy the data from this:

#### Load packages -------------------------------------------------------------

library(haven)
library(tidyverse)
library(lubridate)

#### Acquire raw data ----------------------------------------------------------

# Crash data (from Harper and Palayew)
download.file("https://osf.io/kj7ub/download", "~/Downloads/farsp/farsp.zip")
unzip("~/Downloads/farsp/farsp.zip", exdir = "~/Downloads/farsp")

dta_files = list.files(path = "~/Downloads/farsp", pattern = "*.dta", full.names = TRUE)
dta_files = setNames(dta_files, dta_files)

fars = map_df(dta_files, read_dta, .id = "id") 

# Geographic lookup
geog = read_csv("https://www2.census.gov/geo/docs/reference/codes/files/national_county.txt",
                col_names = c("state_name", "state_code", "county_code", 
                              "county_name", "FIPS_class_code")) %>%
    mutate(state = as.numeric(state_code),
           count = as.numeric(county_code),
           FIPS = paste0(state_code, county_code))

#### Data wrangling ------------------------------------------------------------
# Used https://osf.io/drbge/ Stata code as a guide for cleaning

# All data
# This might take awhile... go get a coffee
all_accidents = fars %>%
    # What are state and county codes/look ups?
    select(id, state, county, month, day, hour, minute, st_case, per_no, veh_no,
           per_typ, age, sex, inj_sev, death_da, death_mo, death_yr, 
           death_hr, mod_year, death_mn, death_tm, lag_hrs, lag_mins) %>%
    # CAPS used to avoid conflict with lubridate
    rename(MONTH = month, DAY = day, HOUR = hour, MINUTE = minute) %>%
    mutate_at(vars(MONTH, DAY, HOUR, MINUTE), na_if, 99) %>%
    mutate(crashtime = HOUR * 100 + MINUTE,
           YEAR = as.numeric(gsub("\\D", "", id)) - 10000,
           DATE = as.Date(paste(YEAR, MONTH, DAY, sep = "-")),
           TIME = paste(HOUR, MINUTE, sep = ":"),
           TIMESTAMP = as.POSIXct(paste(DATE, TIME), format = "%Y-%m-%d %H:%M"), 
           e420 = case_when(
               MONTH == 4 & DAY == 20 & crashtime >= 1620 & crashtime <= 2359 ~ 1,
               TRUE ~ 0),
           e420_control = case_when(
               MONTH == 4 & (DAY == 20 | DAY == 27) & crashtime >= 1620 & crashtime < 2359 ~ 1,
               TRUE ~ 0),
           d420 = case_when(
               crashtime >= 1620 & crashtime <= 2359 ~ 1,
               TRUE ~ 0),
           sex = factor(case_when(
               sex == 2 ~ "F",
               sex == 1 ~ "M",
               sex >= 8 ~ NA_character_,
               TRUE ~ NA_character_)),
           Period = factor(case_when(
               YEAR < 2004  ~ "Remote (1992-2003)",
               YEAR >= 2004 ~ "Recent (2004-2016)",
               TRUE ~ NA_character_)),
           age_group = factor(case_when(
               age <= 20 ~ "<20y",
               age <= 30 ~ "21-30y",
               age <= 40 ~ "31-40y",
               age <= 50 ~ "41-50y",
               age <= 97 ~ "51-97y",
               age == 98 | age == 99 | age == 998 ~ NA_character_,
               is.na(age) ~ NA_character_,
               TRUE ~ NA_character_))
           ) %>%
        filter(per_typ == 1, 
           !is.na(MONTH),
           !is.na(DAY))

# Daily+Time Group
# This should match 420-data.dta observations at https://osf.io/ejz28/ 
# Verify: dta_orig = read_dta("https://osf.io/ejz28/download")
# arsenal::compare(daily_accidents_time_groups, dta_orig)
daily_accidents_time_groups = all_accidents %>%
    group_by(DATE, d420) %>%
    summarize(fatalities_count = n())

# Daily+Time Group final working data
# Only use data starting in 1992
daily_accidents_time_groups = all_accidents %>%
    filter(YEAR > 1991) %>%
    group_by(DATE, d420) %>%
    summarize(fatalities_count = n())

# Daily final working data
daily_accidents = all_accidents %>%
    filter(YEAR > 1991) %>%
    group_by(DATE) %>%
    summarize(fatalities_count = n())

For this dataset, it's especially important to remember the following caveats from the main #tidytuesday page:

"We will have many sources of data and want to emphasize that no causation is implied. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our guidelines are to use the data provided to practice your data tidying and plotting techniques. Participants are invited to consider for themselves what nuancing factors might underlie these relationships.

The intent of Tidy Tuesday is to provide a safe and supportive forum for individuals to practice their wrangling and data visualization skills independent of drawing conclusions. While we understand that the two are related, the focus of this practice is purely on building skills with real-world data."

[1]. Harper S, Palayew A The annual cannabis holiday and fatal traffic crashes. BMJ Injury Prevention. Published Online First: 29 January 2019. doi: 10.1136/injuryprev-2018-043068. Manuscript and original data/code at https://osf.io/qnrg6/

[2]. Staples JA, Redelmeier DA. The April 20 cannabis celebration and fatal traffic crashes in the United States. JAMA Intern Med. 2018 Feb;178(4):569–72.

Hypoxia data set

I am a glider pilot, and when gliders go above 14,000 feet, the pilot is required to have a supplemental oxygen source.

The magazine of the Soaring Society of America (SSA), Soaring, recently published an article about the lack of oxygen and/or carbon dioxide during flight, and the table caught my eye.

I received permission from the author and editor to post the article and "crowdsource" different means of presenting the data, which could include alternative tabular representations or other more visual means.

Here is the article in full. Hypoxia Article proof.pdf

I've transcribed the Table 1 data into a comma-separated text file, since the table is an image in the article. table1.txt

The author, the editor, and I are very interested in the products of everyone's imaginations! SSA is a non-profit, the author was not paid for his work, and the table originated from Guyton & Hall: Textbook of Medical Physiology, 12th ed. Attribution is all that is requested.

Johnson, D. (2018, August). Hypoxia, Hyperventilation, and Supplemental Oxygen Systems. Soaring, 19-27.

Hall, J. E., & Guyton, A. C. (2011). Guyton and Hall Textbook of Medical Physiology, 12e. Philadelphia: Elsevier Saunders.

WHO Tuberculosis Data

Hi there,

The WHO release Tuberculosis data (TB) which I have wrapped in an R package. It might be a good fit for tidytuesday as the data itself is pretty interesting (i.e covers a full range of countries with a good level of detail on the epidemiology of TB) and there are quite a few angles that can be explored.

The tooling I have built up in the package is meant to facilitate first passes at visualisations and could potentially help newer R users get going with their own more novel visualisations. A general intro to the package is here (also links out to several case studies and gists containing additional visualisations).

No worries if not of interest!

Data Viz Resources

Color Palettes

https://blog.datawrapper.de/colors/
https://blog.datawrapper.de/colorguide/
https://github.com/EmilHvitfeldt/r-color-palettes
http://www.sthda.com/english/wiki/colors-in-r

more CA fire damage data available

I've got more historical data for fires that occurred in CAL FIRE's jurisdiction, including NUMBER OF FIRES , ACRES BURNED and DOLLAR DAMAGE form 1933 to 2016.
Repo here
Data file available here

Caselaw Access Project - Bulk Download or API

The Caselaw Access Project (CAP) has digitized over 40 million pages of US court decisions. What better way to promote civic engagement than through caselaw?!?

Use the bulk download (currently only IL and AR) without login. Other states with login and user agreement.
Use API for up to 500 cases per user per day.

Service: https://case.law/
Bulk: https://case.law/bulk/
API: https://case.law/api/
Announcement: https://lil.law.harvard.edu/blog/2018/10/29/caselaw-access-project-cap-launches-api-and-bulk-data-service/

Open data about the Nobel Laureates and the Nobel Prize awarded achievements

From a Swedish page (https://oppnadata.se/datamangder/#esc_term=nobel), the dataset is in English with these variables: year, category, overallMotivation, id, firstname, surname, motivation, share

http://api.nobelprize.org/v1/prize.csv

Data on French High Speed Train Delays

I found the data on high speed trains on the French train company (SNCF) open data site
They have several datasets:

between stations : https://frama.link/2Puw3PM1
national summaries: https://frama.link/3cVHUhZL

The data on the stations (including the zipcode and the GPS coordinates are in another file): https://frama.link/Qh7-jkJ7 . I am afraid I have not been able to join both dataframes, my fuzzy joining and general data wrangling skills are not good enough yet, but maybe some people will be up to the challenge.

data: climate model versus climate history

Just an idea:

Statistical comparison of climate model data versus paleoclimatological history data, both available through:

https://www.ncdc.noaa.gov/data-access

Brian

Fed R&D Budget

https://twitter.com/costasamaras/status/1093292162182193154?s=21

https://www.aaas.org/programs/r-d-budget-and-policy/federal-rd-budget-dashboard

Instacart Orders

I ran across this dataset on Medium. It is a sample dataset of grocery orders from Instacart, the Uber Eats of grocery delivery. Might be good for both data visualization as well as ML techniques.

Link to article: https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2
Link to dataset: https://www.instacart.com/datasets/grocery-shopping-2017

@jthomasmock