cmu-delphi / covidcast Goto Github PK
View Code? Open in Web Editor NEWR and Python packages supporting Delphi's COVIDcast effort.
Home Page: https://delphi.cmu.edu/covidcast/
R and Python packages supporting Delphi's COVIDcast effort.
Home Page: https://delphi.cmu.edu/covidcast/
Currently they give an error because start_day
and end_day
are not in proper format (missing hyphens).
> covidcast_signal("fb-survey", "raw_cli", start_day = "20200510")
Error in charToDate(x) :
character string is not in a standard unambiguous format
Currently, the function check_valid_forecaster_output
(in evaluate.R) makes sure that the forecaster has outputted all 23 quantile levels, hard-coded as covidhub_probs
:
covidhub_probs <- c(0.01, 0.025, seq(0.05, 0.95, by = 0.05), 0.975, 0.99)
However, this is overly restrictive since, as Vishnu pointed out, the COVID Hub submission instructions specify that for the task of "N wk ahead inc case", only 6 quantile levels need to be submitted.
Some options:
┆Issue is synchronized with this Asana task by Unito
Split from #13 - Source shapefiles for HRR or use in the plotting functions.
I think they're here https://data.cms.gov/widgets/ia25-mrsk (go to menu>download in top right for shapefiles) or here https://atlasdata.dartmouth.edu/downloads/supplemental#boundaries.
The following line of code returns no data
covidcast::covidcast_signal("jhu-csse", "deaths_cumulative_num", geo_type = "state", geo_values = "mo", start_day = "2020-03-10", end_day = "2020-06-14", as_of="2020-05-06")
And gives warning messages like
50: Fetching deaths_cumulative_num from jhu-csse for 20200428 in geography 'mo': no results
However data for "mo" was certainly available at this time, as can be seen from this file:
https://github.com/CSSEGISandData/COVID-19/blob/476c78eb96eb2d34483daea4c2fc44f3b38bf847/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv#L1604
Can these early data for "mo" be added (maybe there are other locations missing too? this was just the one that we stumbled across), or a warning returned saying that data provided may not be complete?
Point people to the mailing list so they can hear about updates (and we can ask how they're using it).
data_source = "indicator-combination"
signal = "confirmed_7dav_incidence_prop"
I use Metro Areas, but looking at your map, it appears to missing for states and counties as well.
The package only has polygons for states and counties, so it does not support mapping MSAs and HRRs. If we can find and include such polygons, we can easily add plotting support for MSAs and HRRs.
The correlation-utils branch (soon-to-be-merged, I believe) includes correlation functionality right in the R package. This will make computing correlations for our own purposes much easier and clean. The covidcast_cor()
function is the place to look: it allows us to
The correlation vignette gives lots of examples of this.
We should refactor the signal correlations notebook (and I'm guessing also the correlations shiny app?) so that it uses the package functionality. This will make the code cleaner and allow us to do more powerful correlation analyses; for example, the signal correlations notebook only looks at correlations over one week. With the package functionality, it's very easy to extend this to all time.
One thing to note: the signal correlations notebook (and the correlations shiny app, I think by inheritance) looks at "sweep cuts" of counties by population. This made the correlations look better (bigger counties, better correlations). This isn't something that I implemented in the covidcast_cor()
function because I think it was actually just a proxy for screening out counties where there was hardly any COVID activity before computing correlations. Now I suggest just screening out counties with less than (say) 500 total COVID cases before computing correlations; again the vignette gives examples of this.
Tagging krivard capnrefsmmat jsharpna because they might have ideas of who could tackle this. It's relatively low priority but would be a good way for somebody to onboard themselves both with respect to getting to know our R client and our signals. Also tagging nloliveira huisaddison in case they're interested.
┆Issue is synchronized with this Asana task by Unito
I'm currently seeing the following error:
RuntimeError: ('Error when fetching metadata from the API', 'error: Expecting value: line 1 column 1 (char 0)')
When getting metadata:
>>> import covidcast
>>> covidcast.metadata()
Current implementation:
In [5]: df = covidcast.signal('fb-survey','smoothed_cli', start_day = date(2020,8,3), end_day = date(2020,8,4),
geo_type = 'state', geo_values='CA')
In [6]: df
Out[6]:
geo_value signal time_value direction issue lag value stderr sample_size geo_type data_source
0 ca smoothed_cli 2020-08-03 None 2020-09-03 31 0.659253 0.035870 38134.4402 state fb-survey
0 ca smoothed_cli 2020-08-04 None 2020-09-03 30 0.612967 0.034581 38285.4505 state fb-survey
In [7]: df = covidcast.signal('fb-survey','smoothed_cli', start_day = date(2020,8,3), end_day = date(2020,8,4),
geo_type = 'state', geo_values=['CA', 'CA'])
In [8]: df
Out[8]:
geo_value signal time_value direction issue lag value stderr sample_size geo_type data_source
0 ca smoothed_cli 2020-08-03 None 2020-09-03 31 0.659253 0.035870 38134.4402 state fb-survey
0 ca smoothed_cli 2020-08-04 None 2020-09-03 30 0.612967 0.034581 38285.4505 state fb-survey
0 ca smoothed_cli 2020-08-03 None 2020-09-03 31 0.659253 0.035870 38134.4402 state fb-survey
0 ca smoothed_cli 2020-08-04 None 2020-09-03 30 0.612967 0.034581 38285.4505 state fb-survey
Feels like they should return the same thing.
Should just be one liner set(geo_values)
if this is what we want to do
In writing the Facebook blog post, I wanted to know a bunch of basic things about the Facebook survey, and found myself writing code to find the answers. Then I thought we should just build a very basic dashboard so that we can constantly check these things. Dashboard is probably too generous, I'm just talking about a notebook that gets run automatically each day. I put one up here.
This was pretty easy to build with the functionality in our covidcast R package. If people agree it's useful, then we should make one of these for each of our signals, and write a script that runs these each morning.
Note: I guess we should discuss what this is for and who would use it.
Bottom line, I would say, is that: nothing fancy here. I think we want to keep these notebooks simple to maintain and simple to look at.
Tagging krivard capnrefsmmat jsharpna because they might have ideas of who could tackle this. It's relatively low priority but would be a good way for somebody to onboard themselves both with respect to getting to know our R client and our signals.
┆Issue is synchronized with this Asana task by Unito
Right now the requirements.txt
file is just for the CI tool to install testing dependencies. May also be useful to generate a requirements file for devs to be able to install things like wheel, sphinx (and extensions), etc.
Python package version of #48
For example, suppose I want to do a regression where I predict cases using various other signals. I need to obtain the signals, join them together, maybe lag some variables, and produce a data frame with one row per location and time, containing all measurements at that location/time.
We should provide a function, name to be determined, that
takes a bunch of signal data frames
joins them together and names the value columns after the signals
optionally lags some variables, e.g. if I want to predict cases with lagged signals
returns one big data frame
Then we should put a few examples in the package docs showing how we can use this to conduct a data analysis.
Currently only choropleths are supported with plotting. Add support for bubble maps like what is available in the R package
Thread in #68. Basically the current shapefiles dont line up perfectly with the background light grey state map, so they need to be translated slightly (which was done by hand). There's an additional source of the files (Dartmouth, with the following terms), but AK and HI don't line up correctly for those. Need to figure out some more rigorous way to get consistency between HRR and the existing census shapefiles
┆Issue is synchronized with this Asana task by Unito
Current plotting behaviour is to have all counties without a signal value inherit the megacounty value, and they all get plotted individually. A "proper" implementation would be to have the megacounty plotted as one polygon like it is on the website.
Could do this either by joining polgyons or by plotting all states with the megacounty value and then layering on the counties.
┆Issue is synchronized with this Asana task by Unito
Currently geo/plotting functions only support state and county. Add support for MSAs as well. Blocked by #40
Currently they are tibbles, i.e. of class
"tbl_df" "tbl" "data.frame"
.
Instead, we'd have them inherit from the above but be their own class. For example, a predictions card would have class
"predictions_card" "tbl_df" "tbl" "data.frame"
and then
print.predictions_card
would print out some of the attribute information (for example, name of the forecaster, forecast_date
, ahead
, etc.)plot.predictions_card
would make the trajectory plotsSame idea for scorecards.
┆Issue is synchronized with this Asana task by Unito
When I tried to re-run this, it threw an error - because the API itself is returning no results
I believe this is the API call it's trying to do:
API call results as of 2020-05-07 16:42 (Eastern):
{"result":-2,"message":"no results"}
Here's permalink to line of code in question:
It's currently in the R package, but not exported. Remaining issues:
data_source = "jhu-csse"
, and signal = c("confirmed_incidence_num", "confirmed_incidence_prop")
.geo_types
in one call?plot.covidcast_signal()
handle data frames containing multiple signals? It could make a grid of plots, perhaps; faceting doesn't work easily because different signals are on different scales.I also think we should provide a convenience function that wraps pivot_wider
to convert a data frame with multiple signals to have one row per date and one column per signal, named after the signal, instead of one row per observation per signal. People might like such a data frame (it'd be useful for making feature matrices for models).
┆Issue is synchronized with this Asana task by Unito
Split from #13 - Source shapefiles for MSA or use in the plotting functions.
MSAs are part of Core based statistical areas, which can be found on the census boundary file site. Combined statisticals are not MSAs, as they are "various combinations of adjacent metropolitan and micropolitan areas with economic ties measured by commuting patterns"
The upstream data generation pipeline relies on accurate warnings to detect anomalies. Currently, if the process generates warning, I have to explicitly request to see the warnings at the end of the run. It would really help to send the warnings to a logger as they happen.
The code below, which can be placed in covidcast/R/zzz.R
as is, allows a user to set an option covidcast.warning
. If not set, it falls back to the standard base::warning
. In addition, each call to warning("bad", "stuff", ...)
has to be replaced with getOption("covidcast.warning")("bad", "stuff", ...)
, which is a simple search and replace. Example below. Happy to provide a PR if needed.
zzz.R
## Set up flexible logging options on load
.onLoad <- function(libname, pkgname) {
opts <- options()
opts.covidcast <- list(
## Default warning function is the base warning function
covidcast.warning = base::warning
)
toset <- !(names(opts.covidcast) %in% names(opts))
if (any(toset)) {
options(opts.covidcast[toset])
}
invisible()
}
> library(covidcast)
We encourage COVIDcast API users to register on our mailing list:
https://lists.andrew.cmu.edu/mailman/listinfo/delphi-covidcast-api
We'll send announcements about new data sources, package updates,
server maintenance, and new features.
> m <- covidcast_signal("fb-survey", "raw_cli", geo_type = "msa",
geo_values = name_to_cbsa("Pittsburgh"))
Fetched day 20200406: 1, success, num_entries = 1
...
Fetched day 20200901: 1, success, num_entries = 1
Warning message:
In single_geo(data_source, signal, start_day, end_day, geo_type, :
Fetching raw_cli from fb-survey for 20200412 in geography '38300': no results
Now set option.
options(covidcast.warning=logger::log_warn)
> m <- covidcast_signal("fb-survey", "raw_cli", geo_type = "msa",
geo_values = name_to_cbsa("Pittsburgh"))
Fetched day 20200406: 1, success, num_entries = 1
Fetched day 20200407: 1, success, num_entries = 1
Fetched day 20200408: 1, success, num_entries = 1
Fetched day 20200409: 1, success, num_entries = 1
Fetched day 20200410: 1, success, num_entries = 1
Fetched day 20200411: 1, success, num_entries = 1
WARN [2020-09-03 15:31:06] Fetching raw_cli from fb-survey for 20200412 in geography '38300': no results
Fetched day 20200413: 1, success, num_entries = 1
...
Currently mypy doesn't raise errors if certain objects, such as matplotlib Figures and GeoPandas GeoDataFrames are typed incorrectly. Working theory is because it doesn't recognize them any considers them Any
, and then everything passes. Either find mypy stubs for these objects, make stubs/custom types, or come up with some other config that captures this.
Also explore pytype
┆Issue is synchronized with this Asana task by Unito
The Get started with covidcast guide includes a link to the source at vignettes/covidcast.Rmd. This is likely intended to lead to:
https://github.com/cmu-delphi/covidcast/blob/main/R-packages/covidcast/vignettes/covidcast.Rmd
but actually leads to the non-existent:
https://github.com/cmu-delphi/covidcast/blob/main/vignettes/covidcast.Rmd
and results in a 404.
While the vignette is viewable online, I followed the instructions at the covidcast R package documentation and found that I could not reference the vignettes from RStudio i.e. vignette("covidcast") does not work.
This appears to affect other vignette links as well.
Based on the format of other packages, it looks like a covidcast/Meta/vignette.rds file is missing, as well as covidcast/doc, so there may be an issue with the package configuration.
┆Issue is synchronized with this Asana task by Unito
Copied from cmu-delphi/covidcast-indicators#197
Blocked by cmu-delphi/delphi-epidata#184
┆Issue is synchronized with this Asana task by Unito
Add some additional info to the DEVELOP.md file re: developing, testing, documentation, etc. Will be useful with more people coming onboard
#29 introduces a covidcast_cor
function for calculating correlations between time series. Sometimes, though, some of the geographies being correlated have very little data, and so you get this warning:
## Warning in cor(x = value.x, y = value.y, use = use, method = method): the
## standard deviation is zero
That happens if a geography only has a couple days of data, so the relationship between two signals is a perfect line.
You can see this happen in vignettes/correlation-utils.Rmd
in the final code block. We simply suppressed warnings for that block; we should figure out a more elegant approach, such as having covidcast_cor
catch the repeated warnings and issue only one warning that says this problem has occurred.
┆Issue is synchronized with this Asana task by Unito
The R package has some convenience data included to look up counties, MSAs, etc., and includes polygons to map some of them. Even if the Python package doesn't include mapping features, we should find a good source of similar data.
If there's an existing Python package that includes the data and polygons we want, we could just include a couple examples showing how to use it. If there isn't, we can adapt R-packages/data-raw/make.R
to produce the data files in a format the Python package can use easily.
Have looked into this a little bit w/ the GeoPandas docs. Can't save to shapefile since it's a datetime, so would have to cast that col to a string. When I save to geojson, 5 out of 3233 geometries don't line up for some reason. When I plot them, they look malformed. Need to look more into this.
Now that we have the Census data included in the R package, it'd be convenient to add helper functions to look up geo IDs by name:
x %>%
filter(geo_value == fips_lookup("Allegheny")) %>%
plot(plot_type = "line")
(Example from Ryan)
We could do this for counties and MSAs pretty easily, and they'd be convenient for use with the geo_values
argument. They should be vectorized, so one could do msa_lookup(c("Pittsburgh", "San Antonio", "Milwaukee"))
and get something that can be passed to geo_values
.
As of #29, the R package now has functions fips_to_name
, name_to_fips
, etc. for working with FIPS codes and CBSA IDs. See the documentation.
This would be very handy for the Python package, because it's pretty inconvenient to have to figure out the right code for a county or MSA to be able to query data for it. I suspect the Python versions can work very similarly to the R versions, using the Census shapefiles (or other Census data, such as the Census data frames produced for the R package) to get names and IDs.
The functions in evalcast's get_covidhub_predictions.R currently use rvest
and xml2
to scrape data off of github.
gh
. This means rewriting evalcast::get_forecast_dates
and evalcast::get_covid_hub_forecaster_names
.EDIT: Apparently the zoltr R package is intended for this. So a starting point would be to look at that package.
evalcast::get_covidhub_predictions
retrieves a csv file from a forecaster for a given forecast date, the "as of" principle should apply, i.e., we should get the csv file that had been posted as of the forecast date. Nick Reich noted that this is not always the case, as sometimes forecasters replace their file later. I'm not sure how much this occurs, but this is certainly an important point for ensuring fairness.┆Issue is synchronized with this Asana task by Unito
In ggplot2's geom_point
, size
means radius, rather than area, so plot_bubble
generates bubbles with radius proportional to value instead of area proportional to value. This exaggerates the size of the largest bubbles.
Currently plot_bubble
manually constructs a size scale. We should see if we can use scale_size_binned_area()
or similar to achieve the same things but with scaling by area.
Given a signal over multiple dates, return an animation (maybe using matplotlib animation)
Currently geo/plotting functions only support state and county. Add support for MSAs as well. Blocked by #38
Currently Puerto Rico is not plotted even if a signal exists for that region.
┆Issue is synchronized with this Asana task by Unito
Consider the following:
import covidcast
from datetime import date
import matplotlib.pyplot as plt
data = covidcast.signal("fb-survey", "smoothed_hh_cmnty_cli",
date(2020, 9, 8), date(2020, 9, 8),
geo_type="state")
covidcast.plot_choropleth(data, figsize=(6, 4))
plt.title("% who know someone who is sick, Sept 8, 2020")
I get this:
The colorbar is simply much too small for this size. I have to increase the figsize to at least (12, 10)
to fit the colorbar text without collisions. Is it possible to adjust the colorbar drawing so it uses the available space more effectively? We could have it be full-width, but then large maps would look goofy.
For both the R and Python package documentation, we need to be a bit more welcoming. A new user will want to know
┆Issue is synchronized with this Asana task by Unito
The R package provides mapping; the Python package does not. Admittedly it's not common for Python packages to provide their own plotting in the way it is for R packages to.
We should look at Cartopy and other relevant Python packages and see what it'd take to make mapping easy in the Python package. That need not mean implementing maps ourselves -- we could, for example, discover that by rearranging our data in a certain way, it's easy to provide to Cartopy. Then we can just provide the right adapter and give some examples of usage in the documentation.
This probably depends on #13.
Come up with some way to test the plotting functions in test_plotting.py. Testing plots sucks since everything renders differently. Long thread on it in #27.
┆Issue is synchronized with this Asana task by Unito
Realized this wasn't being mocked out, and so the unit tests are currently making network calls which is undesireable.
A primary use case would be when a forecaster uses a static data frame (e.g., census data).
get_predictions
would be passed these additional arguments using ...baseline_forecaster
should be modified to demonstrate this functionality? Or at least its documentation could beplot.covidcast_signal
has various extra options that are currently documented with TODO explain all these advanced parameters
. We should document them.
Right now mypy is fairly bare bones. We should add some flags to enforce behaviour regarding untyped defs and also having None
not always return Optional.
Example of config file https://github.com/markkohdev/mypy-example/blob/master/mypy.ini
For example, suppose I want to do a regression where I predict cases using various other signals. I need to obtain the signals, join them together, maybe lag some variables, and produce a data frame with one row per location and time, containing all measurements at that location/time.
We should provide a function, name to be determined, that
value
columns after the signalsThen we should put a few examples in the package docs showing how we can use this to conduct a data analysis.
┆Issue is synchronized with this Asana task by Unito
@chinandrew, any changes we need to make before we should release 0.0.10? Should we celebrate by calling it 0.1.0?
Given a list of predictions cards corresponding to the same forecasting task except with varying ahead and forecaster, makes a plot where one sees the actual response over a specified period and the forecast trajectories, faceted by location. Each forecaster's trajectories are a different color (showing median). The 100(1-alpha)% prediction interval is shown with a gray band (geom_ribbon
).
plot_trajectory(list_of_predictions_cards, first_day = "2020-07-01, last_day = "2020-10-01", alpha = 0.2)
.
In terms of the "actual response" referred to above, one would query from covidcast (should use evalcast's function download_signal). The calculation of "actual" depends on the incidence_period
... when this is "epiweek"
, then the sum of the response over the last 7 days would be good to show. However, when this is "day"
, then just the raw response should be shown (this is to accommodate cumulative forecasting).
ajgreen93 and jsharpna both have code for making trajectory plots for the epiweek case... this would be a good starting point.
┆Issue is synchronized with this Asana task by Unito
The MSA is Michigan City - LaPorte, cbsa/geo_value = 33140, source = "indicator-combination", signal = "confirmed_7dav_incidence_prop"
Aug 1st value = 812.385598. It's been around 8 or 9 recently, so I'm guessing this should be 8.12385598, but would like to confirm.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.