cmu-delphi / covidcast Goto Github PK

View Code? Open in Web Editor NEW

33.0 33.0 28.0 2.02 GB

R and Python packages supporting Delphi's COVIDcast effort.

Home Page: https://delphi.cmu.edu/covidcast/

R 4.79% Python 1.22% Makefile 0.02% Batchfile 0.01% HTML 93.97% CSS 0.01%

covidcast's People

Contributors

Stargazers

Watchers

covidcast's Issues

Fix examples in ?covidcast_signal

Currently they give an error because start_day and end_day are not in proper format (missing hyphens).

> covidcast_signal("fb-survey", "raw_cli", start_day = "20200510")
Error in charToDate(x) : 
  character string is not in a standard unambiguous format

not hard-coding covidhub_probs

Currently, the function check_valid_forecaster_output (in evaluate.R) makes sure that the forecaster has outputted all 23 quantile levels, hard-coded as covidhub_probs:

covidhub_probs <- c(0.01, 0.025, seq(0.05, 0.95, by = 0.05), 0.975, 0.99)

However, this is overly restrictive since, as Vishnu pointed out, the COVID Hub submission instructions specify that for the task of "N wk ahead inc case", only 6 quantile levels need to be submitted.

Some options:

simplest approach: remove this check or make it a warning rather than an error
enforce (based on the forecasting task) one of these two COVID Hub sets of probs
don't enforce any particular choice but make sure the set of quantile levels reported is consistent across all locations and aheads for a forecaster?

┆Issue is synchronized with this Asana task by Unito

Find or produce HRR data files for the Python package

Split from #13 - Source shapefiles for HRR or use in the plotting functions.

I think they're here https://data.cms.gov/widgets/ia25-mrsk (go to menu>download in top right for shapefiles) or here https://atlasdata.dartmouth.edu/downloads/supplemental#boundaries.

Missouri ("mo") data from 'jhu-csse' missing for early dates

The following line of code returns no data

covidcast::covidcast_signal("jhu-csse", "deaths_cumulative_num", geo_type = "state", geo_values = "mo", start_day = "2020-03-10", end_day = "2020-06-14", as_of="2020-05-06")

And gives warning messages like

50: Fetching deaths_cumulative_num from jhu-csse for 20200428 in geography 'mo': no results

However data for "mo" was certainly available at this time, as can be seen from this file:
https://github.com/CSSEGISandData/COVID-19/blob/476c78eb96eb2d34483daea4c2fc44f3b38bf847/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv#L1604

Can these early data for "mo" be added (maybe there are other locations missing too? this was just the one that we stumbled across), or a warning returned saying that data provided may not be complete?

Have R package startup message advertise our mailing list

Point people to the mailing list so they can hear about updates (and we can ask how they're using it).

Aug 29th cases per 100K value missing

data_source = "indicator-combination"
signal = "confirmed_7dav_incidence_prop"

I use Metro Areas, but looking at your map, it appears to missing for states and counties as well.

Implement mapping for MSAs and HRRs in bubble plots

The package only has polygons for states and counties, so it does not support mapping MSAs and HRRs. If we can find and include such polygons, we can easily add plotting support for MSAs and HRRs.

Refactor signal correlations notebook (and app?)

The correlation-utils branch (soon-to-be-merged, I believe) includes correlation functionality right in the R package. This will make computing correlations for our own purposes much easier and clean. The covidcast_cor() function is the place to look: it allows us to

choose whether we slice by time, or slice by geo location, before computing correlations
specify an arbitrary lag (or lead) for each of the signals before computing correlations.

The correlation vignette gives lots of examples of this.

We should refactor the signal correlations notebook (and I'm guessing also the correlations shiny app?) so that it uses the package functionality. This will make the code cleaner and allow us to do more powerful correlation analyses; for example, the signal correlations notebook only looks at correlations over one week. With the package functionality, it's very easy to extend this to all time.

One thing to note: the signal correlations notebook (and the correlations shiny app, I think by inheritance) looks at "sweep cuts" of counties by population. This made the correlations look better (bigger counties, better correlations). This isn't something that I implemented in the covidcast_cor() function because I think it was actually just a proxy for screening out counties where there was hardly any COVID activity before computing correlations. Now I suggest just screening out counties with less than (say) 500 total COVID cases before computing correlations; again the vignette gives examples of this.

Tagging krivard capnrefsmmat jsharpna because they might have ideas of who could tackle this. It's relatively low priority but would be a good way for somebody to onboard themselves both with respect to getting to know our R client and our signals. Also tagging nloliveira huisaddison in case they're interested.

┆Issue is synchronized with this Asana task by Unito

Error when fetching metadata from the API

I'm currently seeing the following error:

RuntimeError: ('Error when fetching metadata from the API', 'error: Expecting value: line 1 column 1 (char 0)')

When getting metadata:

>>> import covidcast
>>> covidcast.metadata()

Have geo_values not take duplicates?

Current implementation:

In [5]: df = covidcast.signal('fb-survey','smoothed_cli', start_day = date(2020,8,3), end_day = date(2020,8,4),
                              geo_type = 'state', geo_values='CA')                                                                                       

In [6]: df                                                                                                                                                                                                                                 
Out[6]: 
  geo_value        signal time_value direction      issue  lag     value    stderr  sample_size geo_type data_source
0        ca  smoothed_cli 2020-08-03      None 2020-09-03   31  0.659253  0.035870   38134.4402    state   fb-survey
0        ca  smoothed_cli 2020-08-04      None 2020-09-03   30  0.612967  0.034581   38285.4505    state   fb-survey

In [7]: df = covidcast.signal('fb-survey','smoothed_cli', start_day = date(2020,8,3), end_day = date(2020,8,4), 
                              geo_type = 'state', geo_values=['CA', 'CA'])                                                                               

In [8]: df                                                                                                                                                                                                                                 
Out[8]: 
  geo_value        signal time_value direction      issue  lag     value    stderr  sample_size geo_type data_source
0        ca  smoothed_cli 2020-08-03      None 2020-09-03   31  0.659253  0.035870   38134.4402    state   fb-survey
0        ca  smoothed_cli 2020-08-04      None 2020-09-03   30  0.612967  0.034581   38285.4505    state   fb-survey
0        ca  smoothed_cli 2020-08-03      None 2020-09-03   31  0.659253  0.035870   38134.4402    state   fb-survey
0        ca  smoothed_cli 2020-08-04      None 2020-09-03   30  0.612967  0.034581   38285.4505    state   fb-survey

Feels like they should return the same thing.

Should just be one liner set(geo_values) if this is what we want to do

County out of place in R bubble plots

There's a county being placed somewhere in Canada in the bubble plots (just under "confirmed"):

Address R CMD CHECK notes

Create basic dashboards for our signals

In writing the Facebook blog post, I wanted to know a bunch of basic things about the Facebook survey, and found myself writing code to find the answers. Then I thought we should just build a very basic dashboard so that we can constantly check these things. Dashboard is probably too generous, I'm just talking about a notebook that gets run automatically each day. I put one up here.

This was pretty easy to build with the functionality in our covidcast R package. If people agree it's useful, then we should make one of these for each of our signals, and write a script that runs these each morning.

Note: I guess we should discuss what this is for and who would use it.

I think it's mainly for us, and it allows us to keep basic tabs on our indicators just by browsing a notebook.
We could assign one person from the engineering team to take a look at one of these notebooks, once per day (which ideally would update itself).
For example, I can see from the notebook linked above that the Facebook survey has been consistently trending down in responses per day and counties per day, over the last month ... if this keeps up we might go and tell Facebook.
We could also use it to do some very basic anomaly detection, nothing fancy---we'd have to add a few plots for this. This might be similar to what we're already getting from Larry (Pickett?) but could be worth just implementing something very basic on our side as well.

Bottom line, I would say, is that: nothing fancy here. I think we want to keep these notebooks simple to maintain and simple to look at.

┆Issue is synchronized with this Asana task by Unito

Separate out requirements for CI and Dev

Right now the requirements.txt file is just for the CI tool to install testing dependencies. May also be useful to generate a requirements file for devs to be able to install things like wheel, sphinx (and extensions), etc.

Add tools to Python client to rearrange data for analysis

Python package version of #48
For example, suppose I want to do a regression where I predict cases using various other signals. I need to obtain the signals, join them together, maybe lag some variables, and produce a data frame with one row per location and time, containing all measurements at that location/time.

We should provide a function, name to be determined, that

takes a bunch of signal data frames
joins them together and names the value columns after the signals
optionally lags some variables, e.g. if I want to predict cases with lagged signals
returns one big data frame

Then we should put a few examples in the package docs showing how we can use this to conduct a data analysis.

Add bubble map support for plotting

Currently only choropleths are supported with plotting. Add support for bubble maps like what is available in the R package

Debug HRR shapefile discrepancy

Thread in #68. Basically the current shapefiles dont line up perfectly with the background light grey state map, so they need to be translated slightly (which was done by hand). There's an additional source of the files (Dartmouth, with the following terms), but AK and HI don't line up correctly for those. Need to figure out some more rigorous way to get consistency between HRR and the existing census shapefiles

┆Issue is synchronized with this Asana task by Unito

Implement proper megacounty plotting

Current plotting behaviour is to have all counties without a signal value inherit the megacounty value, and they all get plotted individually. A "proper" implementation would be to have the megacounty plotted as one polygon like it is on the website.

Could do this either by joining polgyons or by plotting all states with the megacounty value and then layering on the counties.

┆Issue is synchronized with this Asana task by Unito

Add HRR maps in Python package

Currently geo/plotting functions only support state and county. Add support for MSAs as well. Blocked by #40

Make predictions cards and score cards S3 objects with print and plot functions

Currently they are tibbles, i.e. of class

"tbl_df" "tbl" "data.frame".

Instead, we'd have them inherit from the above but be their own class. For example, a predictions card would have class

"predictions_card" "tbl_df" "tbl" "data.frame"

and then

print.predictions_card would print out some of the attribute information (for example, name of the forecaster, forecast_date, ahead, etc.)
plot.predictions_card would make the trajectory plots

Same idea for scorecards.

┆Issue is synchronized with this Asana task by Unito

API returning "no results" for get_cases in example: "jhu-cases", "confirmed_incidence", 20200411-20200417

When I tried to re-run this, it threw an error - because the API itself is returning no results

I believe this is the API call it's trying to do:

https://delphi.cmu.edu/epidata/api.php?source=covidcast&data_source=jhu-cases&signal=confirmed_incidence&time_type=day&geo_type=county&time_values=20200411-20200417&geo_value=%2A

API call results as of 2020-05-07 16:42 (Eastern):

{"result":-2,"message":"no results"}

Here's permalink to line of code in question:

covidcast/R-notebooks/signal_correlations.Rmd

Line 47 in c58e240

cases = get_data("jhu-cases", "confirmed_incidence", dates)

Finish the covidcast_signals function

It's currently in the R package, but not exported. Remaining issues:

How should the sources/signals be specified? Currently it accepts a data frame, but Ryan thinks separate vectors with recycling would be convenient, so you could specify things like data_source = "jhu-csse" , and signal = c("confirmed_incidence_num", "confirmed_incidence_prop").
Should it be possible to select signals with different geo_types in one call?
How should plot.covidcast_signal() handle data frames containing multiple signals? It could make a grid of plots, perhaps; faceting doesn't work easily because different signals are on different scales.

I also think we should provide a convenience function that wraps pivot_wider to convert a data frame with multiple signals to have one row per date and one column per signal, named after the signal, instead of one row per observation per signal. People might like such a data frame (it'd be useful for making feature matrices for models).

┆Issue is synchronized with this Asana task by Unito

Find or produce MSA data files for the Python package

Split from #13 - Source shapefiles for MSA or use in the plotting functions.

MSAs are part of Core based statistical areas, which can be found on the census boundary file site. Combined statisticals are not MSAs, as they are "various combinations of adjacent metropolitan and micropolitan areas with economic ties measured by commuting patterns"

Add flexible warning option for production pipeline in the R package

The upstream data generation pipeline relies on accurate warnings to detect anomalies. Currently, if the process generates warning, I have to explicitly request to see the warnings at the end of the run. It would really help to send the warnings to a logger as they happen.

The code below, which can be placed in covidcast/R/zzz.R as is, allows a user to set an option covidcast.warning. If not set, it falls back to the standard base::warning. In addition, each call to warning("bad", "stuff", ...) has to be replaced with getOption("covidcast.warning")("bad", "stuff", ...), which is a simple search and replace. Example below. Happy to provide a PR if needed.

File `zzz.R`

## Set up flexible logging options on load
.onLoad <- function(libname, pkgname) {
    opts <- options()
    opts.covidcast <- list(
        ## Default warning function is the base warning function
        covidcast.warning = base::warning
    )
    toset <- !(names(opts.covidcast) %in% names(opts))
    if (any(toset)) {
        options(opts.covidcast[toset])
    }

    invisible()
}

Example

> library(covidcast)
We encourage COVIDcast API users to register on our mailing list:
https://lists.andrew.cmu.edu/mailman/listinfo/delphi-covidcast-api
We'll send announcements about new data sources, package updates,
server maintenance, and new features.
> m  <- covidcast_signal("fb-survey", "raw_cli", geo_type = "msa",
                  geo_values = name_to_cbsa("Pittsburgh"))

Fetched day 20200406: 1, success, num_entries = 1
...
Fetched day 20200901: 1, success, num_entries = 1
Warning message:
In single_geo(data_source, signal, start_day, end_day, geo_type,  :
  Fetching raw_cli from fb-survey for 20200412 in geography '38300': no results

Now set option.

 options(covidcast.warning=logger::log_warn)
> m  <- covidcast_signal("fb-survey", "raw_cli", geo_type = "msa",
                  geo_values = name_to_cbsa("Pittsburgh"))
Fetched day 20200406: 1, success, num_entries = 1
Fetched day 20200407: 1, success, num_entries = 1
Fetched day 20200408: 1, success, num_entries = 1
Fetched day 20200409: 1, success, num_entries = 1
Fetched day 20200410: 1, success, num_entries = 1
Fetched day 20200411: 1, success, num_entries = 1

WARN [2020-09-03 15:31:06] Fetching raw_cli from fb-survey for 20200412 in geography '38300': no results
Fetched day 20200413: 1, success, num_entries = 1
...

Figure out how to have mypy (or some other checker)enforce typing for geopandas and matplotlib objects

Currently mypy doesn't raise errors if certain objects, such as matplotlib Figures and GeoPandas GeoDataFrames are typed incorrectly. Working theory is because it doesn't recognize them any considers them Any, and then everything passes. Either find mypy stubs for these objects, make stubs/custom types, or come up with some other config that captures this.

Also explore pytype

┆Issue is synchronized with this Asana task by Unito

Vignettes are not included in package and source links lead to 404s

The Get started with covidcast guide includes a link to the source at vignettes/covidcast.Rmd. This is likely intended to lead to:
https://github.com/cmu-delphi/covidcast/blob/main/R-packages/covidcast/vignettes/covidcast.Rmd

but actually leads to the non-existent:
https://github.com/cmu-delphi/covidcast/blob/main/vignettes/covidcast.Rmd
and results in a 404.

While the vignette is viewable online, I followed the instructions at the covidcast R package documentation and found that I could not reference the vignettes from RStudio i.e. vignette("covidcast") does not work.

This appears to affect other vignette links as well.

Based on the format of other packages, it looks like a covidcast/Meta/vignette.rds file is missing, as well as covidcast/doc, so there may be an issue with the package configuration.

┆Issue is synchronized with this Asana task by Unito

Update client to utilize pagination

Copied from cmu-delphi/covidcast-indicators#197
Blocked by cmu-delphi/delphi-epidata#184

┆Issue is synchronized with this Asana task by Unito

Update python develop readme with more info

Add some additional info to the DEVELOP.md file re: developing, testing, documentation, etc. Will be useful with more people coming onboard

Handle warnings raised in covidcast_cor

#29 introduces a covidcast_cor function for calculating correlations between time series. Sometimes, though, some of the geographies being correlated have very little data, and so you get this warning:

## Warning in cor(x = value.x, y = value.y, use = use, method = method): the
## standard deviation is zero

That happens if a geography only has a couple days of data, so the relationship between two signals is a perfect line.

You can see this happen in vignettes/correlation-utils.Rmd in the final code block. We simply suppressed warnings for that block; we should figure out a more elegant approach, such as having covidcast_cor catch the repeated warnings and issue only one warning that says this problem has occurred.

┆Issue is synchronized with this Asana task by Unito

Find or produce county and state data files for the Python package

The R package has some convenience data included to look up counties, MSAs, etc., and includes polygons to map some of them. Even if the Python package doesn't include mapping features, we should find a good source of similar data.

If there's an existing Python package that includes the data and polygons we want, we could just include a couple examples showing how to use it. If there isn't, we can adapt R-packages/data-raw/make.R to produce the data files in a format the Python package can use easily.

Figure out how to save GeoDFs for testing

Have looked into this a little bit w/ the GeoPandas docs. Can't save to shapefile since it's a datetime, so would have to cast that col to a string. When I save to geojson, 5 out of 3233 geometries don't line up for some reason. When I plot them, they look malformed. Need to look more into this.

Add geo lookup helper functions

Now that we have the Census data included in the R package, it'd be convenient to add helper functions to look up geo IDs by name:

x %>% 
  filter(geo_value == fips_lookup("Allegheny")) %>%
  plot(plot_type = "line")

(Example from Ryan)

We could do this for counties and MSAs pretty easily, and they'd be convenient for use with the geo_values argument. They should be vectorized, so one could do msa_lookup(c("Pittsburgh", "San Antonio", "Milwaukee")) and get something that can be passed to geo_values.

Provide functions to look up FIPS codes, CBSA IDs, etc.

As of #29, the R package now has functions fips_to_name, name_to_fips, etc. for working with FIPS codes and CBSA IDs. See the documentation.

This would be very handy for the Python package, because it's pretty inconvenient to have to figure out the right code for a county or MSA to be able to query data for it. I suspect the Python versions can work very similarly to the R versions, using the Census shapefiles (or other Census data, such as the Census data frames produced for the R package) to get names and IDs.

Use Zoltar API rather than scraping GitHub

The functions in evalcast's get_covidhub_predictions.R currently use rvest and xml2 to scrape data off of github.

Do this the proper way with the github API, accessible through the R package gh. This means rewriting evalcast::get_forecast_dates and evalcast::get_covid_hub_forecaster_names.

EDIT: Apparently the zoltr R package is intended for this. So a starting point would be to look at that package.

When evalcast::get_covidhub_predictions retrieves a csv file from a forecaster for a given forecast date, the "as of" principle should apply, i.e., we should get the csv file that had been posted as of the forecast date. Nick Reich noted that this is not always the case, as sometimes forecasters replace their file later. I'm not sure how much this occurs, but this is certainly an important point for ensuring fairness.

┆Issue is synchronized with this Asana task by Unito

Fix R bubble plots to scale by area instead of radius

In ggplot2's geom_point, size means radius, rather than area, so plot_bubble generates bubbles with radius proportional to value instead of area proportional to value. This exaggerates the size of the largest bubbles.

Currently plot_bubble manually constructs a size scale. We should see if we can use scale_size_binned_area() or similar to achieve the same things but with scaling by area.

Animate signals over time

Given a signal over multiple dates, return an animation (maybe using matplotlib animation)

Add MSA maps in Python package

Currently geo/plotting functions only support state and county. Add support for MSAs as well. Blocked by #38

Add Puerto Rico to R plots

Currently Puerto Rico is not plotted even if a signal exists for that region.

┆Issue is synchronized with this Asana task by Unito

Python choropleth plots not legible at small sizes

Consider the following:

import covidcast
from datetime import date
import matplotlib.pyplot as plt
data = covidcast.signal("fb-survey", "smoothed_hh_cmnty_cli",
                        date(2020, 9, 8), date(2020, 9, 8),
                        geo_type="state")
covidcast.plot_choropleth(data, figsize=(6, 4))
plt.title("% who know someone who is sick, Sept 8, 2020")

I get this:

The colorbar is simply much too small for this size. I have to increase the figsize to at least (12, 10) to fit the colorbar text without collisions. Is it possible to adjust the colorbar drawing so it uses the available space more effectively? We could have it be full-width, but then large maps would look goofy.

Make the package documentation serve as a better starter guide

For both the R and Python package documentation, we need to be a bit more welcoming. A new user will want to know

How can I find a certain type of data? What kind of data is available? The package docs refer you to the COVIDcast API docs, but don't give any overview and don't explain how to find the appropriate data source and signal names to plug into the packages.
What does it mean that this is an API? Explain how the data is fetched live from the server when you request it, rather than being distributed from the package.
How up-to-date is this data? Both generally discuss how the API is updated daily, and specifically how to use the metadata to get the most recent update date.
How do I interpret the data? We cover this a bit in the documentation of return values, but we may want to be more explicit in the vignettes/examples. (Also, again refer the user to the API docs when they want to know how to interpret a specific signal.)

┆Issue is synchronized with this Asana task by Unito

Provide examples or integration to produce state and county maps in Python package

The R package provides mapping; the Python package does not. Admittedly it's not common for Python packages to provide their own plotting in the way it is for R packages to.

We should look at Cartopy and other relevant Python packages and see what it'd take to make mapping easy in the Python package. That need not mean implementing maps ourselves -- we could, for example, discover that by rearranging our data in a certain way, it's easy to provide to Cartopy. Then we can just provide the right adapter and give some examples of usage in the documentation.

This probably depends on #13.

Test plotting functions

Come up with some way to test the plotting functions in test_plotting.py. Testing plots sucks since everything renders differently. Long thread on it in #27.

┆Issue is synchronized with this Asana task by Unito

Mock _signal_metadata in Python plotting test

Realized this wasn't being mocked out, and so the unit tests are currently making network calls which is undesireable.

Allow a forecaster to have additional arguments

A primary use case would be when a forecaster uses a static data frame (e.g., census data).

get_predictions would be passed these additional arguments using ...
Perhaps baseline_forecaster should be modified to demonstrate this functionality? Or at least its documentation could be
these additional arguments would then be passed to forecasters

Finish documentation of advanced plotting parameters

plot.covidcast_signal has various extra options that are currently documented with TODO explain all these advanced parameters. We should document them.

Adjust mypy settings to be more useful

Right now mypy is fairly bare bones. We should add some flags to enforce behaviour regarding untyped defs and also having None not always return Optional.

Example of config file https://github.com/markkohdev/mypy-example/blob/master/mypy.ini

Add tools to rearrange data for analysis

For example, suppose I want to do a regression where I predict cases using various other signals. I need to obtain the signals, join them together, maybe lag some variables, and produce a data frame with one row per location and time, containing all measurements at that location/time.

We should provide a function, name to be determined, that

takes a bunch of signal data frames
joins them together and names the value columns after the signals
optionally lags some variables, e.g. if I want to predict cases with lagged signals
returns one big data frame

Then we should put a few examples in the package docs showing how we can use this to conduct a data analysis.

┆Issue is synchronized with this Asana task by Unito

Cut a new Python client release

Update changelog to describe changes since 0.0.9
Rebuild all documentation and proofread for typos
Change version number in setup.py and the Sphinx config file
Build new package dists
Upload to PyPI
Announce to API mailing list

@chinandrew, any changes we need to make before we should release 0.0.10? Should we celebrate by calling it 0.1.0?

add function that makes trajectory plots

Given a list of predictions cards corresponding to the same forecasting task except with varying ahead and forecaster, makes a plot where one sees the actual response over a specified period and the forecast trajectories, faceted by location. Each forecaster's trajectories are a different color (showing median). The 100(1-alpha)% prediction interval is shown with a gray band (geom_ribbon).

plot_trajectory(list_of_predictions_cards, first_day = "2020-07-01, last_day = "2020-10-01", alpha = 0.2).

In terms of the "actual response" referred to above, one would query from covidcast (should use evalcast's function download_signal). The calculation of "actual" depends on the incidence_period... when this is "epiweek", then the sum of the response over the last 7 days would be good to show. However, when this is "day", then just the raw response should be shown (this is to accommodate cumulative forecasting).

ajgreen93 and jsharpna both have code for making trajectory plots for the epiweek case... this would be a good starting point.

┆Issue is synchronized with this Asana task by Unito

Cases_per_100K value probably off by a couple decimal places

The MSA is Michigan City - LaPorte, cbsa/geo_value = 33140, source = "indicator-combination", signal = "confirmed_7dav_incidence_prop"

Aug 1st value = 812.385598. It's been around 8 or 9 recently, so I'm guessing this should be 8.12385598, but would like to confirm.

cmu-delphi / covidcast Goto Github PK

covidcast's People

Contributors

Stargazers

Watchers

Forkers

covidcast's Issues

File zzz.R

Example

Recommend Projects

Recommend Topics

Recommend Org

File `zzz.R`