cmu-delphi / covidcast-indicators Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 17.0 219.11 MB

Back end for producing indicators and loading them into the COVIDcast API.

Home Page: https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html

License: MIT License

Python 78.07% R 13.75% Shell 0.51% Jupyter Notebook 4.94% Makefile 1.29% Jinja 1.32% Dockerfile 0.12%

covidcast-indicators's People

Contributors

Stargazers

Watchers

Forkers

sgsmob jedgrabman dfarrow0 benjaminysmith dshemetov davidkretch qx-teo krivard alexcoda manisci jingjtang deeeeksha leonardzh leonlu2 m5skid neul3 pauloscarin-dev

covidcast-indicators's Issues

Automate all public signals

Entails moving code to produce each into the covidcast-indicators repository. NB doctor-visits and fb-survey have restricted-access data sources, so for them we first need a way to test automation code on safe data.

Code ingestion:

Verify new code produces same results as old code:

Automation:

Add weighted version of Facebook community signals

Create inverse probability weighted versions of Facebook community signals. So, same as we have, just ending in wcli and wili. (Facebook themselves will be the main consumers of this signal.). Probably best as a job for Taylor, after he finishes the refactoring.

[doctor-visits] Use Jeffreys estimator for binomial proportion

The doctor visits pipeline should estimate the binomial proportion using the Jeffreys estimate, where

$$ \hat p = \frac{x + 0.5}{n + 1} $$

That's the posterior mean if you put a Beta(0.5, 0.5) prior on p; see here. This would be consistent with the Facebook community question and the Google survey pipeline.

When we report standard errors internally, these should be reported based on this estimate, to avoid ever reporting SE = 0.

We should use this for internal reports immediately (cc @huisaddison), but probably should not alter public estimates without a release note, unless you find the changes in estimates are very small compared to e.g. normal backfill.

Finalize combined signal `nmf_day_doc_fbs_ght`

Monitor nowcasting's progress and be ready to pick this up for an off-cycle release.

Load jhu-csse backfilled data

Currently, the jhu-csse pipeline derives its daily incidence and cumulative by reading from the confirmed_US and deaths_US sheets here:
https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

(Note that daily reports are also available, but they do not stretch very far in time for the United States).

I recently learned that JHU backfills its US data: https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/Errata.csv

The number of changes to the US data are sufficiently small that ignoring them probably would not be much worse than rounding error; but in principle we should probably incorporate them.

Audit geographical_scope and update to newest FIPS codes

As shown in the API docs, the JHU CSSE data has different FIPS coding than the rest of our data -- because it uses newer FIPS codes for several counties whose codes changed in 2015.

That implies that the rest of our data sources are actually reporting outdated FIPS codes. I think this is because geographical_scope was cobbled together from several sources. We should

expand the README to systematically document sources, FIPS code years, dates populations are for, etc.
update our FIPS reporting to the latest definitions (like JHU)
do sanity checks to make sure all the geography files are consistent with each other, population proportions are correct, etc.

Updating FIPS codes would require coordination with viz and release notes.

Consider new ways to calculate direction

There are several things to consider:

The deaths signal is sparse (many zeros) and high-noise (no smoothing), so direction can jump around a lot in individual counties. Should it smooth more? Is it meaningful to declare direction in these cases?
We set a threshold for what counts as increasing or decreasing on the other signals. Are we happy with this threshold?

Note that currently, direction is calculated for all signals server-side, so the ideal solution would not require special-casing specific signals.

[fb-survey] bug in w*li raw and smoothed

A known issue with the fb-survey is that if a user tagged for the survey forwards the survey to friends, their survey responses get tagged with the same identifier, but the only survey response to which the fb-generated weight applies is the user originally selected for the survey.

Previously, we have addressed this at the aggregation step by throwing out all but the earliest survey for each identifier. This works, but is slow, since it requires loading the cumulative list of all tokens and their earliest known start dates.

While transitioning the system to pseudo-incremental (where we dump partially-processed survey responses into a big bucket and store for the next run, so that we only have to fully process the last week's worth of data or so) I foolishly split off the step of generating the identifier list for a day in such a way that it does not get antijoined against the cumulative list. This has caused us to have duplicate identifier-weight pairs for 33 surveys going back to the very first week of the survey.

Recommended fix:

For past identifiers and weights, keep only the earliest identifier-weight pair in each set. This generates a bunch of edits that exceed our (arbitrary, but still) validation limits, details below. Most of them are large enough that I'd be more comfortable noting them in the release notes than silently passing them through.
For future identifier lists, anti join against the cumulative list.

Objections?

Diffs by geo type:

hrr: 5 raw (below), 35 smoothed
msa: 2 raw (below), 14 smoothed
state: 4 raw (below), 22 smoothed
county: 3 raw (below), 15 smoothed

[1] "raw_wcli"
[1] "hrr"
        date geo_id     val.x      se.x sample_size.x effective_sample_size.x
1 2020-04-06    155 1.3361589 0.8031495           181                178.0729
2 2020-04-08    113 0.4751491 0.1059180          2820               1758.6626
3 2020-04-09    145 0.7352730 0.3614189           439                356.8198
4 2020-04-09    223 0.7999764 0.5034265          2179               1445.3862
5 2020-04-30     56 0.3355428 0.1329136          1157                865.6997
      val.y      se.y sample_size.y effective_sample_size.y val.mismatch
1 1.5321662 0.8219147           182                178.9988         TRUE
2 0.4728166 0.1057039          2821               1703.0832        FALSE
3 0.7287457 0.3597578           440                349.8120        FALSE
4 0.7975445 0.5019053          2180               1435.6523        FALSE
5 0.3336672 0.1327289          1158                851.6769        FALSE
  se.mismatch sample_size.mismatch effective_sample_size.mismatch
1       FALSE                FALSE                          FALSE
2       FALSE                FALSE                           TRUE
3       FALSE                FALSE                           TRUE
4       FALSE                FALSE                           TRUE
5       FALSE                FALSE                           TRUE
[1] "raw_wcli"
[1] "msa"
        date geo_id     val.x      se.x sample_size.x effective_sample_size.x
1 2020-04-08  47900 0.6446462 0.1258031      3912.972                2407.896
2 2020-04-30  31080 0.3344158 0.1191181      1557.844                1170.382
      val.y      se.y sample_size.y effective_sample_size.y val.mismatch
1 0.6423696 0.1254606      3913.972                2354.284        FALSE
2 0.3330240 0.1188767      1558.844                1156.206        FALSE
  se.mismatch sample_size.mismatch effective_sample_size.mismatch
1       FALSE                FALSE                           TRUE
2       FALSE                FALSE                           TRUE
[1] "raw_wcli"
[1] "state"
        date geo_id     val.x       se.x sample_size.x effective_sample_size.x
1 2020-04-08     md 0.5978971 0.10563249      5852.989               3858.5343
2 2020-04-09     md 0.6106332 0.23702405      4831.990               3285.0041
3 2020-04-10     nh 0.5424426 0.25401259       652.000                467.4285
4 2020-04-30     ca 0.3513494 0.06685219      8207.078               6220.2927
      val.y      se.y sample_size.y effective_sample_size.y val.mismatch
1 0.5965956 0.1054311      5853.989               3805.2662        FALSE
2 0.6097974 0.2366978      4832.990               3274.1975        FALSE
3 0.5608735 0.2585968       653.000                489.9669        FALSE
4 0.3510655 0.0668012      8208.078               6205.3173        FALSE
  se.mismatch sample_size.mismatch effective_sample_size.mismatch
1       FALSE                FALSE                           TRUE
2       FALSE                FALSE                           TRUE
3       FALSE                FALSE                           TRUE
4       FALSE                FALSE                           TRUE
[1] "raw_wcli"
[1] "county"
        date geo_id     val.x      se.x sample_size.x effective_sample_size.x
1 2020-04-06  17031 0.8046518 0.2241702     1166.0908                665.7235
2 2020-04-08  24021 0.3922447 0.2390339      324.7628                344.3633
3 2020-04-30  06037 0.3346056 0.1328722     1151.7269                864.2107
      val.y      se.y sample_size.y effective_sample_size.y val.mismatch
1 0.8248768 0.2249927     1167.0908                666.2835         TRUE
2 0.3815037 0.2223784      325.7628                382.5789        FALSE
3 0.3327233 0.1326897     1152.7269                850.0887        FALSE
  se.mismatch sample_size.mismatch effective_sample_size.mismatch
1       FALSE                FALSE                          FALSE
2       FALSE                FALSE                           TRUE
3       FALSE                FALSE                           TRUE

[fb-survey] audit individual response files

Startdatetime is in UTC; should be pacific time
Something fishy with weights missingness
Verify that backfill stops after 4 days

Improve Google survey megacounty estimates to have less variance (update historical data)

Using a higher prior strength to reduce megacounty variance.

Enrich location tooltip

Make the location tooltip (the one that shows when hovering over any location) more informative.
Instead of saying: "Percentage: 1.3%", it should say something like "Percentage of Doctors Visits that are due to COVID-like symptoms: 1.3%".
This is what FB, ACHD and others currently do. It is more friendly to newcomers.
Right now this string is taken from the 'Y-Axis' variable in the strings document. It should have its own string.

Add mobility indicators

We should add mobility indicators. I think we were implicitly hesitant here because the causal arrow is completely backwards: it’s a reflection of policy, which itself a reflection of the interpretation of current COVID activity. So we may want to create a new category, just like we have “Official Reports”. Maybe “Mobility Reports” and list estimates both Google and Safegraph? The main point is that it'd just be helpful to have these on the map, to visually compare against the other indicators.

Ingest Quidel COVID test data

Antigen test results will start coming to us Monday. Not as high-quality as PCR, but much faster, and still good enough for clinical use.

Visualize count data

We have a great demo from Addison; the remaining work on this ticket is for the viz team.

include Puerto Rico in map

Expand API documentation further

It needs to cover

megacounties and how they are FIPS coded
basics of sample size rules (when counties are missing)
the new combined sensor
Google surveys are not being updated

Design and automate a signal dashboard

We need a way to tell if something goes wrong with our signals. That could mean:

Sample size of a signal goes down unexpectedly
Signal is available in fewer geographic areas than normal
Standard errors unexpectedly change

Katie has someone who can check the COVIDcast site each day to ensure it looks right, but the map doesn't reveal everything, so a behind-the-scenes dashboard would be useful. Maybe a simple Rmd report that gets generated each day?

Improve backfill model for DV signal

Maria and Jacob have been working on a better backfill model. Maria's implemented it, as far as I understand, and it looks quite promising. Just needs to go trough more rigorous checking, which Aaron can help with.

[ght] Possible backfill on the API side

Some points from the indicators meeting discussion today:

We definitely know of the possibility that the most recent day of data may be incomplete
It may be possible that in addition to non-deterministic return values from the API, the API may "backfill" data up to 3 days, depending on what server logs are available to be queried (@brookslogan correct me if I have misstated this)

Possible solutions:

Subject data to "backfill" for up to k days, i.e., when we pull new data, drop the final k rows in the pre-existing cache and append newly queried rows to them.

If only the most recent day may be incomplete, then we only need to take k=1, but if the GHT API is indeed subject to a longer period of backfill, then we need to take k larger.

Ultimately, it would be most useful if we could obtain some additional references / documentation for GHT, otherwise we could just set a rule where we backfill k=5 and hope for the best... @krivard @capnrefsmmat

Ingest and archive CDC hospitalization data

From @ryantibs:

We should start ingesting the CDC NSHN Hospitalization Data, and the CDC COVID-NET hospitalization data. These appear to be two alternative CDC run surveillance system for state-level COVID hospitalizations. We should compare these against each other---in whatever ways are possible---and also to state level COVID hospitalization data collected from the state depts of health, through COVID Tracking.

From @RoniRos:

It's important to start this ASAP, because there is backfill (aka data revisioning), and no guarantee that anyone is storing the historical versions, which are critical for real-time forecasting.

Until we download and compare these, it won't be clear what exactly should go into the API, so we should

download the data regularly, archiving historical versions so we understand the backfill
compare the data from the different sources
determine which sources we want to use
determine if these should be in the API
put them in the API

This issue will just be for the first three steps; once we know what should go into the API and how, the Indicators team can plan what release this should go into.

Improve smoothed Facebook signal in locations that only occasionally meet sample size thresholds

If a county only has one observation over two weeks, the raw signal will have a spike in it; the smoothed signal will spike and then drop later, since it takes the last 7 days of data.

If a county consistently has observations, but most days they're under the sample size threshold, the raw signal will report NAs on most days and signals on others. That's fine, but the smoothing uses the past 7 days, and will smooth over the NAs and possibly create strange visual artifacts.

Is there a better smoothing or filtering method to avoid this?

Document Quidel's suspension in API docs

Prefer to have #3 done before enacting this.

Other decisions to make:

What should happen to the quidel data already in the epidata DB? Transfer to private (#6)?

Store a forecast view of covidcast epidata signals

Essentially, this means extending epidata.covidcast to also store an issue date.

Some questions:

When/where/how should issue be updated -- whenever there is any new data for a signal, or only for the dates receiving updated figures?
How will clients query against issue? With an exact match, or a range?

The following code will need to be adjusted:

covidcast acquisition
covidcast direction calculations
epidata clients

We will also need to:

migrate existing data, possibly requiring a brief map outage

Consider producing epiweek rollup estimates for the surveys

For example: the Facebook survey includes many questions beyond the two we currently report. Some of these may not be useful to map every day, but they could be useful to researchers and journalists studying the effects of COVID-19. For example, there are questions about how many people the respondent has been in close contact with, and questions about their current mental health.

Producing aggregations for full epiweeks may let us report these quantities with larger sample sizes, and hence report for more counties.

[fb-survey] synthetic dataset and unit tests

spec out what kind of data is needed
generate data or adapt existing data to be delphi-safe
complete tests

Examine Facebook survey for possible new sensors

This paper says loss of smell or taste are excellent predictors; we should make a sensor out of the data we're getting from the facebook survey on that (B2==13)

[jhu-csse] Add smoothed cases/deaths signals TBD to show in the map

From Ryan:

Seems like with cases and deaths, we want to aggregate and display these over the course of a whole week.

Aligned with epiweek would be the most standard way to do it, and this would display a much more stable signal.

Weekly signals aligned with epiweek are already supported in the API (thanks David).

But would require a significant change on the viz side (forcing the time slider/time series to progress in terms of weeks)?

We could hack a solution by passing it a daily signal with the same estimate for 7 days in a row.

Another possibility is to display trailing 7-day averages, similar to the other signals.

Motivation here would be increased stability of time series and associated direction calls (ref #1), but we'd have to think carefully about who our primary audience is ("some combination of PH officials, other government officials, HC professionals, and the general public" but not sure which group has priority [Roni]) and what they expect from a cases/deaths signal. Are we presenting the JHU data "as a 'benchmark' against which we suggest users compare our indicators" -- in which case a 7-day rolling average makes the most sense, since that's what the other indicators use -- or "as a source of 'information'" -- in which case we should go for aggregation by epiweek [Addison].

One way that's easy to understand is to report the current "death rate", which can be some locally smoothed estimate of the underlying rate of deaths per day. [Alex]

Roni has asked Viz to find ways to support user selection of all combinations of {incidence,cumulative} X {counts,ratios} X {Cases, Deaths}, so it may turn out to be straightforward to add epiweek aggregations as another view of the data. This would let us provide a visual interface to the raw JHU data as well as a more public-focused construction of smoothed ratios or rates.

[fb-survey] se for weighted metrics is still NA when value=0

Was this supposed to work already, or is this a new task?

The original reason for the change in se calculations was to fix an artifact in the covidcast time series display. Since weighted figures are not used in this display, I'm okay with this being low-priority, keeping in mind facebook will probably appreciate this fix when we get around to it.

Get release log posted on the COVIDcast website

Currently there is a draft in the COVIDcast Materials folder on Google Drive. We need to finish this up and get marketing to put it in the CMS. Roni has edit access to the CMS, so we just need to get a final draft to him.

Groups of counties shapefiles

One of the problems in #31 is the fact that JHU reports information for all five boroughs of NYC together in the Manhattan FIPS code (36061). The "correct" way to do this is to create a new polygon which is the union of all five boroughs.

If we are planning to later visualize signals for groups of counties (as described here), we will need a general solution for polygons of groups of counties. @ryantibs @RoniRos are we planning on visualizing whatever comes out of this "groups of counties" data?

Based on feedback from @statsmaths and @capnrefsmmat, it sounds like it is trivial to create new polygons that are the unions of other polygons in a shapefile (multi-polygons?). Something I didn't clarify earlier: if I take the union of two polygons which share a border, will the shared border (which is now on the interior) disappear? That would be necessary for us to create aesthetically pleasing visualizations.

[jhu-csse] Visualization errors

Visualizations are not properly handling the documented FIPS Exceptions.

For example, all of NYC's counts are appearing as only in Manhattan Borough, with zero cases / deaths in the other four boroughs.

Also, two counties are (one in South Dakota, one in Alaska) are missing data because of the mismatched FIPS codes, also documented.

[update below by krivard]

Solution: Maintain the current API signal as "essentially unaltered from JHU." Create a new signal for viz with the following changes:

Have the other four NYC boroughs "mirror" the reporting of Manhattan.
Replace the JHU FIPS codes for the two counties in AK and SD with our own.
Update viz to use new signal.

Produce standard errors for combined signal

Raised at the 12 May meeting.

The current combined signal does not show standard errors. We have standard errors from most of our original signals, and can also measure reconstruction error in our combination system. Can we use this to estimate standard errors or some other measure of uncertainty for the combined signal?

Forecast visualization prototypes

Coordinate with forecasts team and viz team to prototype visualizations for forecasts before Alex C leaves.

Add Facebook community question to the combined signal

Since it's now on the public map, we should update the combined signal to include it.

sanity check the new signal
get the new signal into the API and document it
update the map JSON
release the new map

Consider redefining COVID-like illness and updating Qualtrics surveys to match

CDC guidance on common COVID symptoms has changed; @ryantibs says fever is no longer considered important, whereas fatigue and loss of smell/taste are.

We should

find an authoritative data source on what constitutes COVID-like illness, based on current understanding
decide if we want to change "CLI" in the Facebook survey, since we ask about most symptoms on page 2 even if we don't on page 1
decide whether to change all Qualtrics surveys (including Google and YouTube) to feature the new COVID symptoms first, on page 1

cc @RoniRos

Remove Google Surveys from the map

Figure out how to use Google surveys on small scale

We're not going to be running the Google surveys daily past Friday, May 15. But we are still able to run the surveys on a small scale (i.e. smaller budget) if we want; we have complete control over how the survey is geographically targeted, so we can pick a strategy.

We could use it to augment the Facebook surveys, but the signals do not see comparable; see issue #2. Is there another, better use for these surveys that would improve forecasting or nowcasting?

Consider a tool to compare our maps to others, or compare our maps to each other

Rob thinks it'd be very interesting to be able to compare our maps to other maps released by other groups, e.g. to see where our survey results disagree with activity maps released by others. It could also be interesting to provide a visual way to compare within our maps, to see what different locations they highlight.

To compare with other maps, we'd need some way to ingest their underlying data.

My view: If we aim to build an exploratory viz tool, i.e. one that experts can use to explore data and try different scenarios and so on, then a map or indicator comparator would be useful. But if we're building the viz tool for the general public, they simply want to know how bad the situation is in their area; they're not concerned by the difference between different indicator, because they want just one indicator with the answer.

Evaluate the Facebook community survey signal and finalize its definition

Open questions:

Why does it vary so much from the Google survey signal? Can we combine it with the Google signal, for example by rescaling?
Can we compare to the YouTube signal as well? One hypothesis is that the Facebook signal is higher because people volunteer to complete the survey because they're going to answer "yes"; if so, this would also apply to the YouTube community signal.

Publish a document with technical details on the indicators

The current draft is available in covid-19/notes/signal_descriptions.tex.

@ryantibs took the lead on drafting this, but asked for implementors to help fill in their sections. The best way is probably to make a branch, fill in your section, and make a pull request for Ryan to review, edit, and merge.

To be done, ideally by 1.3 release time:

Doctor's visits (@rumackaaron or @mariajahja)
Facebook surveys; see #20 (comment) for undocumented parts (@brookslogan)
Google survey state and megacounties (@lmackey, @nloliveira, or @robohyun66?)
Google search trends (@huisaddison)
Quidel test data (@jingjtang)
Combined indicator (@lmackey and team)

Let us know if you can't draft the materials in time or need to delegate to someone else.

Once these are drafted, tasks for me:

~~render and copyedit final PDF~~
~~put PDF in the staging site~~
[x ] ~~link to the PDF in the methodology text; probably mention in release notes just to highlight it~~
release publicly

Work out an indicator update schedule

We should try to update our indicators at a consistent time each day, so the combination signal can have all the latest indicators available at the time it runs.

This requires knowing when the data becomes available and when it's normally run. It also depends on our data sources whether this is possible -- data emailed by a third party won't always follow a schedule.

Could the current sensor implementors reply below with the following information? For concreteness, suppose we want to post data for June 1.

When does data for June 1 become available for download or posting? (e.g. if your indicator is lagged several days, specify which day you expect it to be available, and a typical range if there's some variation)
What time of day do you typically run your code to process that data?

Then we can determine

when we could schedule the cron job to run the automated indicators automatically
when we can schedule the combined signal to run

Tooltip text for counts should report both counts and ratios

Tooltip for JHU's Cases & Death should be made more informative, by providing both the counts and the per-100,000 ratio. E.g. it should say something like:
Number of Cases: 842
Population: 42,354
Case ratio: 1988 cases per 100,000 people

See for example the facebook map or the ACHD dashboard.

The facebook map even shows additional information, which would be nice, too.

This will require some code changes, because currently viz operates under a "one tab, one signal" rule.

[jhu-csse] Output numbers for Puerto Rico

The JHU data does include Puerto Rico, but we don't currently load it into the API. With #26 fixed, we can plot this on the map, so we should update the pipeline to do this.

@statsmaths or @huisaddison, I guess this is for either of you? If it's easy it would be nice to have this for 1.3, just so PR isn't completely empty.

[doctor-visits] synthetic dataset and unit tests

Spoke with Maria to spec out what is needed. We'll be getting a Delphi-safe, scrambled raw file for a single age group starting January/February to keep the file size down to something manageable (the actual files are Quite Large). Keep an eye out for a PR.

synthetic data generated
tests complete

Implement seasonal adjustment for Quidel flu tests

Will be useful for the flu testing signal when its sample size recovers.

Launch private epidata API

@brookslogan @korlaxxalrok

This API has two use cases:

Help teams develop new signals for the map without having to publish them in the public API
Provide Delphi members access to datasets protected by a data use agreement

The first case should be rushed out for a coordinated effort with viz while Alex C is still here.

We can and should be more careful with the second case.

Investigate YouTube survey data

Why is the survey data from YouTube users so different from the data from Facebook users? Is this difference suspicious (the data are bad) or useful (the population is different; we're capturing previously-ignored components of ground truth)?

Compare distribution of responses for each survey question
Compare spatial distribution at each geo level (do we get more youtube responses where facebook responses are sparse?)

If it is good, we can consider combining this signal with the facebook one.

Automate inactive signals

google-survey
quidel

Add hospitalizations signal from our healthcare partners

We should add hospitalization estimates from our healthcare datasource, analogous to our DV signals. Based on what I’ve been talking about Roni and others, I think it’d be dangerous to serve projected counts in our API and on the map (because of the possible huge discrepancies between this and the surveillance data on hospitalizations). But we could serve ratios? Fraction of hospitalizations due to covid-like illness. (Just like fraction of DV’s due to COVID-like illness.) We need to name it carefully, to ensure that it’s not viewed as an “Official Report” like hospitalizations coming from public health surveillance (state departments of health).

[jhu-csse] Add signal `cumulative_prop`

Complete the combination chart for both deaths and confirmed

\	cumulative	incidence
counts	✔️	✔️
prop	❓	✔️