nytimes / covid-19-data Goto Github PK

View Code? Open in Web Editor NEW

7.0K 7.0K 3.5K 9.71 GB

A repository of data on coronavirus cases and deaths in the U.S.

Home Page: https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html

License: Other

covid-19

covid-19-data's People

Contributors

Stargazers

Watchers

Forkers

keohaneindustries wm harrietjeon crumplab sheikhbarabas kbschulte cartant mattbutalla wmandrews waylish jeremyfranklin cam295 ellis-n rosstox26 billbarnhill dylanblough lillikulak svadivazhagu cbmeco calebschoot makalu42 chidhvilastammireddy arrendi grantblank chase2981 decimal1700 bvanweelden lcaballero meierme martianspace virustrack panicpanicpanic rbracco rdlauer c-aguirre bradshawben followthesheep zofiianna pinknoiz lfrehill curet aashish24 hamlett-neil-ur mahiki optimific drigb moabgal crbarahona achang2861 rmanda-delve lkhemani kaushiksahoo2000 thomasoide alexandrafett japanesejedi bookbytes burrowsclayton vwolfley moonlitspider nd2528 jqnatividad dkrizo alisonbunce jtmcg3 sme5209 probinson022 james---- aimeemr vk208 panamjck shiva94539 kenoakley digitalxicano susanmarysmith googed rosas761 jarrekrholmes metaphorz andm2010 himesgroup whittkilburn tmg-863 samboy jeff-hutchins bradcink shellyscheng austin3087 renasilverman pgallagher1 nagabhushandas rachelanned westerleyy dougpalmer18 simontye jeremyzilar toadams brianmt ocp-bill episphere mv77

covid-19-data's Issues

FIPS needs definition

Request that README include a definition, or link to a definition, of the FIPS code and what it means in this context.

Is fine-grained/individual data being collected, too? Will it be ever shared?

Hi,

The datasets released on this repo (like those of Johns Hopkins University, at country level) are coarse datasets (total counts of infected/dead/recovered). Such coarseness highly constraints statistical modeling beyond simple¹ descriptive analysis.

By individual/fine-grained data I mean status updates of individuals from testing positive until the case is closed (i.e., by recovery or death), and some background descriptions (age, sex, pre-existing conditions, etc.). A similar dataset have been collected by Xu and colleagues (Nature Scientific Data, DOI:10.1038/s41597-020-0448-0) and is online on @beoutbreakprepared's repository.

I can imagine @nytimes would have legal concerns (e.g., regarding HIPAA laws, as @CBG-63 has mentioned in #11). I'm no lawyer, but I hope some sort of anonymization might waive that.

¹ Actually, in case you are interested in modeling with coarse data, take a look at coarseDataTools R package. It has facilities to estimate, e.g., case fatality rate (Reich et al. 2012; DOI:10.1111/j.1541-0420.2011.01709.x).

REST API for CSV

Thanks nytimes for gathering the data ❤️

I've created a simple REST API which loads the raw CSV files directly from this repo. The usage guide, hosted public API links and code is available here: https://github.com/desholmes/covid-19-us-api.

I've also been helping out with the automation of tomwhite's UK data gathering efforts, let me know if I can help nytimes out too.

Thank you! GIF that animates daily changes in US

Twitter daily updates: https://twitter.com/hashtag/covid19in20s

Other data available ?

Looking at the situation in Europe and already in New York , perhaps ,should you consider that hospitalized and ICU values (and their evolutions) are the main and more important KPI to analyze in the following days ....

NYC fips value is null

This may be intentional/an artifcat of how data is collected, but why does NYC have a null fips value?

I've brought this dataset, along with other covid datasets into BigQuery for comparative analyses purposes. Here's the query, and I've verified the raw CSV file is also missing a value for NYC

This is the resulting visualization as a result (using Tableau):

Last Update Datetime inclusion in README

So users can quickly view the date time stamp for the latest data pushed without clicking on a file csv (this is assuming the repo was not cloned) can a date be included in header of the README displaying the last date and time for which data was push? Travis CI should be able to do this.

Jupyter notebook to get people started?

You may want to consider including a sample Jupyter notebook in your repo to help people get started with this data even more easily. Best of luck and let me know if I can help in any way!

-Tim Novikoff
PM for Colab

Data should drill down to the zipcode of the infected, not just county.

County level is nice, but it would be far more efficacious to be able to drill down to zipcode level. Many counties like mine, Hennepin County, MN, is HUGE and has a WIDE RANGE of density.

data on number of tests?

There's some data sources out there on number of tests by state/county— do you have any plans to try to organize some of this data too?

Add Pending Test Column

Please add a column for pending tests.

Data inconsistency: Your data is 20% less than the posted JHU/WHO data

Comparing your data in us-counties deaths to the JHU/WHO data at https://github.com/CSSEGISandData/COVID-19 is over 20% less. Why such a large discrepancy?

Did you consider using an existing license when releasing this dataset?

First & foremost, thank you so much for publishing this dataset. Your commitment to opening this data in a time of need is so helpful. I can imagine there was more than a little red tape involved in getting this data clear of your legal department & out in the open.

Given that this dataset comes with a custom license, clearly some thought was put into the subject. With that in mind, would you care to comment on why you chose to go with a custom license, and not an existing open data license?

Thanks again, now I'm off to go purchase a subscription to the Times.

County geometry information?

FIPS are included in data sets, but wondered about the best approach for joining this dataset with geometry information about each county so we can visualize this data over time with Kepler.gl? I am not a data scientist, and so have no experience with Python, so was hoping to be able to work with the data in Kepler without coding.

https://github.com/daniel-lij/keplerTest/blob/master/counties.json has county geometry data as well as FIPS embedded for each county.

Can someone with more knowledge than me assist in joining this NYT county csv with the json so that we can somehow view the change in data over time in Kepler? I read that csv with geometry data can take WKT, but unsure how to implement into csv so data can easily be read by Kepler. Is it possible to still have a csv file for future inclusion of data that also includes geometry data, or do they have to separately be imported into Kepler?

Had no great progress moving between json and csv. Throwing geojson and other forms into the mix muddles making it work.

Found an article from Medium that can join unemployment data with geometry data, but my limited Python cannot fully grasp it: https://medium.com/vis-gl/visualizing-u-s-county-unemployment-with-kepler-gl-c5f2ed31c71

Request: Provide date and time the data was collected

Cross validating data with JHU/WHO show discrepancies which might be attributed to differences in the time when data was collect via your source. Can you provide the exact times when the data for a given county is collected?

For example, JHU for New York City on 3/25 show 17857 cases and 199 deaths and your data shows 20011 cases and 280 deaths.

Data marked March 27 should be March 26

Right now it is early on March 27, but these .csv files have all the data for the 27th. That doesn't make sense, because the day hasn't happened yet. The files are also almost totally missing data from March 26. It looks like the data marked for March 27 should have actually been for March 26. I think that also explains why this chart looks like it missing a days worth of day: https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html#g-cases-over-time

That chart is missing data from yesterday, but does have data for today, which hasn't really happened yet.

Can you confirm this data is correct?

Open Source Helps!

Thanks for your work to help the people in need! Your site has been added! I currently maintain the Open-Source-COVID-19 page, which collects all open source projects related to COVID-19, including maps, data, news, api, analysis, medical and supply information, etc. Please share to anyone who might need the information in the list, or will possibly contribute to some of those projects. You are also welcome to recommend more projects.

http://open-source-covid-19.weileizeng.com/

Cheers!

Add data provenance

Seeing as how the csv files stored here are going into high-profile maps, medium blogs, twitter, etc., and then in fact affecting critical policy decisions can you try to add some provenance information: for a given case count in the CSV, where did the data come from? There will almost certainly be discrepancies and incorrect reporting; better to not take it upon yourselves as a newspaper to resolve them. Additionally, reporting the data source will make clear to the public holes in statistics given at the state and local level.

Keep up the good work!

Add folder for historical snapshots of CSVs

A huge subset of people that will be interested in working with this data will likely want to be able to look at the ways in which the cases develop over time in different locales. Right now, the only way to to this with the structure of the repository as it is would be to look at the git history, plucking out the changes at each commit. This would be overly cumbersome for everyone but the most technically sophisticated folks. Please add a folder within which you will add the files appended with an ISO datetime so that people will be able to quickly rip through the folder to generate dynamic representations of the data.

TN Data is very off

TN Data found here is quite different from your reporting. For example, Williamson county has over 90 cases, but according to the us-counties.csv file there are only 35 cases.

Planned update frequency in README

Hello! Thank you for providing this.

Would you be able to provide some guidance about when (e.g. 10a and 4p) and/or how often (e.g. 3x per day) you expect to update the data in the README?

Excluded Counties

You may not be aware but the boroughs of New York City are also counties. Brooklyn=Kings Queens=Queens, Staten Island= Richmond, Bronx=Bronx, Manhattan=New York (obviously problematic naming). The population of Kings county is larger than at least 4 western states. Please consider adding them to your account. The daily data can be found here: https://www1.nyc.gov/assets/doh/downloads/pdf/imm/covid-19-daily-data-summary.pdf.

I live in Brooklyn and I have personally been tracking NY state county numbers here: https://docs.google.com/spreadsheets/d/159YnHmWplkEkhUZbC5oFT9cKiibLVMzaRm4LbhHN4Co/edit?usp=sharing

Thanks for your efforts,
Cooper Miller

Is the data on sex and age available anywhere?

COVID[data]

Is the Dataset used in this NYT Video Available?

YouTube Video Link - How China Is Reshaping the Coronavirus Narrative

I'm working on building a model around classification of information that is spread around the internet about covid-19, as a growing number of actors are using the topic to further their personal agendas. The dataset and method for labeling the data would be extremely helpful.

Great reporting on the video!

CSV vs. Endpoint

Hello! Thank you so much for making this resource available; getting this data in the hands of the public is super important so that we can all collectively contribute in our own ways.

I was wondering if it's possible to get an HTTP endpoint for this data for use in live applications. Right now, I believe the GitHub Contents REST API v3 gives us the ability to fetch file information from a repo, but it would be significantly more convenient if there was an API to which we could pass individual parameters (i.e. county name, state name) and get specific, filtered data (with a last updated timestamp).

Thank you!

Thank you - FIPS to Lat/Long Translation JSON

Hi, thank you for creating a great dataset for us to use. This is not an issue but rather something to help others: I've opened up a new repo for anyone that needs FIPS to general latitude/longitude mappings as a JSON, which is available here. For choropleth maps, this won't be sufficient and one would likely need to use a shapefile, but this is for those of us plotting points on a map directly.

Include the stay at home/safer at home data

There are separate NYTimes datasets supporting a graphic/map about where people have been asked to stay home, can you tie that into this dataset at the state and/or county level?

https://www.nytimes.com/interactive/2020/us/coronavirus-stay-at-home-order.html

Add hospitalizations

Confirmed case counts are skewed by how many tests get run, and who gets tested, both of which differs wildly across the country.

The number of people in a place hospitalized for COVID is a steady estimator for the total number of active infections in that place that have progressed to severe symptoms, which is expected to be a steady fraction of total infections. This is very useful for finer-grained forecasting purposes, as is being done using https://github.com/CodeForPhilly/chime/

Under normal circumstances, we would expect the CDC to gather, collate, and release this kind of information, and even make the forecasts using an in-house team and distribute them. Instead, we see individual hospitals scrambling to pull something together themselves.

Please try to collect this data where possible, and use your connections and clout to get it more consistently released.

Data contradicts other data sources?

Why is this data so incredibly skewed from other data sources such as https://www.worldometers.info/coronavirus/country/us/ or https://coronavirus.jhu.edu/map.html ? I'm not so certain of the data's correctness. If this data is to be believed, it's showing that there are 375007 cases in the United States, which directly contradicts the data sources listed above. Can we get confirmation on the accuracy of this data or at least a list of sources other than "journalists"?

Deaths are per day, right?

I saw in the readme there was cumulative, but it has to be per day, right?

Direct Data from County Officials?

From your list of sources, would you want direct contributions FROM local County EMA's? As a member of mine, ours brought it up... you'd need some sort of validation.

For example, our County, Camden County, GA is listed as 1 case, whereas we do have 6

Add Recovered Stat- and Thank you NYT

Consider adding column showing number of recovered individuals. Thank you for sharing this info with the world.

Interactive Mapping

This is less of an issue but for users:

Here is a repository with a jupyter notebook which will map and then create a Bokeh interactive map:

This is still work in progress, will make enhancements in next day or so. Thank you @nytimes for the great work.

Update frequency?

First of all, thanks for providing this data! The covid dataverse is getting kind of crowded, and I hope the various efforts will begin to consolidate their efforts and collaborate. That said, publishing is always better than not publishing. Unless maybe you bait and switch like JHU, constantly changing the format with little warning... but I digress.

What's the expected frequency and timing of updates to this dataset?

(This would be a good thing to include in an FAQ, per #38 )

Daily count

Could it be possible to show the confirmed cases and deaths (new or cumulative) by date?

... like the Johns Hopkins data. For example:

https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv

New York City: by County

Hi - first off, thank you for this source. Is there a breakout by Queens, Kings, Bronx, Staten Island, New York?

FAQ

request some kind of FAQ including "how may others contribute to this data set?".

who is editing this? what are their qualifications. etc.

enable wiki.

Data inconsistency: Data reported here is not being echo'd by NYT own reported data in news articles

Why is this data not being used in all NYT reporting on the coronavirus? For example: https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html

Data inconsistency: Cumulative data between us-counties.csv and us-states.csv are not equal

If you sum the deaths and cases between each of your sets, they aren't equal. Why?

Maryland data does not fit with official state data

According to the NYT file, there are 581 cummulative confirmed cases as of 3/26/20, but the official MD data here is reporting 992! The smaller number is also being reported in the JHU files, but it worries me about underlying errors in JHU's files. Is NYT using other sources?

Possible to find out workplace?

Hi, Did you happen to ask people where they work, so it might be possible to track that way?

Doña Ana, NM missing FIPS

Hi,
First of all, thank you for making this amazing data set available.
The FIPS code for Doña Ana, NM is 35013, but the FIPS fields are missing in the database whenever Doña Ana is listed (probably because of some ASCII merging error).
Thanks!
Keyon

FIPS code for Unknown county should be 000, not 999

I noticed that some of the data where the county is listed as "Unknown" has a FIPS code ending in 999. I think those should end in 000. For example, there are two records for Maine, Unknown with a FIPS code of 23999. If you look in the Census Bureau's FIPS spreadsheet, the state of Maine by itself has a state code of 23 and a county code of 000, or 23000. In other words, I think FIPS uses 0 to indicate NULL/Unknown, for each sub-level rather than 9.

Maybe this was a deliberate choice so you didn't wind up with FIPS matches for these records, but in that case, maybe leaving the FIPS column blank would make more sense than populating it with a value that doesn't exist in the FIPS reference data.

Shift to Wikipedia-compatible license

This is important, thank you for posting it.
(Do you have related line list data as well?)

Echoing #10 but more specifically: please change the license to a CC license that would let this data be integrated more easily into Wikipedia. WP already has crowdsourced versions of most of this data, with sources, but both projects would benefit from a more thorough synchronization.

The ideal license for data is CC-0, which matches the license of government data (including that in your sources.) That would let this dataset be part of the public knowledge graph (Wikidata), where every revision of each data point can have one or more sources -- important for tracking some of the important nuances here. Please make this happen!

so what?

why are you comparing counties? the population is different and smaller counties may gain false confidence due to low number of cases when it's simply because there's fewer people.

a right and fair comparison would be #of cases per 1,000 people.

data inconsistency

First, thanks so much for this dataset!

A data inconsistency I noticed, at times there's a dip in number of cases or deaths, which doesn't make sense since the data is cumulative. Am I missing something, or is there another reason for this?

date,county,state,fips,cases,deaths,non_cum_deaths
2020-03-25,Benton,Washington,53005,10,2,-2,0
2020-03-25,Chippewa,Wisconsin,55017,1,0,-1,0
2020-03-25,Geauga,Ohio,39055,4,0,-1,0
2020-03-25,Grant,Washington,53025,26,0,-1,-1
2020-03-24,Hillsborough,New Hampshire,33011,20,0,1,-1
2020-03-25,Kittitas,Washington,53037,6,0,-12,0
2020-03-25,Lexington,South Carolina,45063,18,1,-1,0
2020-03-24,Pasco,Florida,12101,16,0,2,-1
2020-03-25,Rockingham,Virginia,51165,2,0,-1,0
2020-03-24,Rockland,New York,36087,671,3,79,-2
2020-03-25,Snohomish,Washington,53061,633,15,19,-1
2020-03-25,Summit,Colorado,08117,9,0,-1,0
2020-03-25,Sumter,South Carolina,45085,10,0,2,-1
2020-03-25,Unknown,Colorado,08999,18,0,0,-2
2020-03-25,Unknown,Hawaii,15999,5,0,-15,0
2020-03-25,Unknown,Missouri,29999,4,0,-68,0
2020-03-25,Unknown,Puerto Rico,72999,51,2,-1,0
2020-03-25,Unknown,Rhode Island,44999,106,0,-18,0
2020-03-25,Unknown,Vermont,50999,23,0,0,-5

Thank you, NYT

I just wanted to express thanks to NYT for releasing this data in the simplest to digest format possible. Fantastic!

And sorry for the noise to everyone who is watching this repository 😄

Also sorry for misappropriating the issue tracker for this. I imagine that it can be less-than-encouraging whenever all the formal feedback received is feature requests. We appreciate the dataset! Thanks!

"True Cases" added to map

Please display "True Cases" on your map! Your map with only confirmed cases is giving citizens like those in Pacifica CA a false sense of security/calm. Academic research from Wuhan data shows "True Cases" are at least 8x confirmed cases. With failed testing in US where probably only 1 in 10 cases are confirmed "True Cases" could be 80x (8 times 10) confirmed cases or worse. People must know how much danger they are in, many many are still not concerned.

Academic paper on Wuhan data showing 11 day lag between "True Cases" and confirmed cases:
https://jamanetwork.com/journals/jama/fullarticle/2762130

Key image I annotated:
https://www.dropbox.com/s/r1w632m82ao18xa/WuhanGraph_1_22_circled.png?dl=0

Tomas Pueyo Seminal, but long (26 minutes) medium post on subject:
https://medium.com/@tomaspueyo/coronavirus-act-today-or-people-will-die-f4d3d9cd99ca

Endorsements of medium post by experts:
https://medium.com/tomas-pueyo/coronavirus-articles-endorsements-fdc68614f8e3

Example expert who endorsed:
Anthony Costello, ex-Director at the World Health Organization. Professor, University College London (tweet)

People to contact to corroborate model to estimate "True Cases":
Scott Morrow, San Mateo Health (recent NYTimes interview: https://www.nytimes.com/2020/03/12/us/coronavirus-san-mateo-scott-morrow.html)
Anthony Costello, ex-Director at the World Health Organization. Professor, University College London

Nextdoor screenshots:
https://www.dropbox.com/s/pbyjwfmir7lhpoc/localPacficaNextDoor_1percent.png?dl=0
https://www.dropbox.com/s/z0rk643o9asl46t/localPacificaNextDoorPosts.png?dl=0

Nextdoor thread:
https://nextdoor.com/news_feed/?post=142178628