Code Monkey home page Code Monkey logo

Comments (11)

ryantibs avatar ryantibs commented on August 18, 2024
  1. Want to make sure that when we compute the Facebook community signal (and with all other CMU-run surveys like YouTube) that we compute the binary value as:

(anybody in your household is sick) OR (anybody in your community, outside of your household)

Whereas in our first construction of this signal, we were only relying on the second clause above.

  1. I want to iron out exactly what we're doing in order to compute the standard error of the Facebook community signal: is it along the lines of what we did with the Facebook "individual" signal (yes, should be), or the Google signal way (no, shouldn't be)? (The Google signal standard errors were based on proper stratified sampling, which we don't have with Facebook, so the same ideas shouldn't be used here.)

  2. Don't need us to combine the Facebook community signal with the Google signal. I think, honestly, to save us work, let's just drop that part. Google's survey will go to sleep very soon (this week), so this combination would only affect what we do historically.

  3. Comparisons to YouTube still interesting, but also not a priority. Just helpful to understand and use as a sanity check. And in the best case, we could combine YouTube and Facebook into one signal (not worth it yet, but if we can demonstrate utility, then I can ask YouTube to crank up sample size)

from covidcast-indicators.

capnrefsmmat avatar capnrefsmmat commented on August 18, 2024

For point 2, on the standard error of the community signal, I can confirm that the estimate and standard error are calculated with

    dplyr::mutate(val = (replied_yes + 0.5) / (n_responses + 1),
                  se = sqrt(val*(1-val)/n_responses))

This is just the Jeffreys approach for a binomial proportion, with nothing fancy for stratified sampling.

from covidcast-indicators.

capnrefsmmat avatar capnrefsmmat commented on August 18, 2024

It's worth thinking about the different quantities we can estimate from the Facebook survey. (This also applies to the YouTube survey.)

  1. Population fraction with CLI. The current raw_cli and smoothed_cli signals attempt to do estimate this by using the number of people in each household and the number of people with CLI in each household, treating households as samples from the population. This makes the estimator more complicated.
  2. Fraction of households with CLI. We don't estimate this, but we could, just by throwing away information (discretize the household questions into just yes/no instead of a number of people).
  3. Fraction of respondents who know someone outside their household with CLI. The raw_community and smoothed_community signals do this.
  4. Fraction of people who know someone with CLI. This is what the Google survey estimated, and what @ryantibs prefers above, and it would involve combining together the previous two estimates.

According to @krivard's previous experiments, option 3 correlates better with cases and deaths than option 4. Option 4 accords better with Google surveys, except the two surveys give very different estimates anyway.

My feeling is that individual correlation with cases and deaths isn't as important as the question, "Does this provide information not provided by other signals?" Two signals that correlate very well with cases, but provide the same information, are less useful than two signals that correlate less well but measure different things and contribute to prediction in different ways. (Say, for example, one reaches elderly people and one mainly young people, and so together they provide more information than one general survey would.)

To that end, I think reporting options 1 and 3 (as we already do) may make more sense than reporting options 1 and 4 (as is proposed), since option 3 is complementary to option 1 while option 4 includes option 4.

from covidcast-indicators.

ryantibs avatar ryantibs commented on August 18, 2024

Thanks for laying all this out, nicely explained. Few comments:

  • Actually we have a lot more info that just what you allude to above. For Q5 on our survey, the respondent actually reports the number of people in their community (outside their household) with symptoms. So we could even compute an empirical distribution of these answers, and create buckets based on quantiles, accordingly, and report the score in buckets. This would give a more fine-grained community signal. (Best would be to adjust for community size, by estimating it geographically, but currently we're not doing that, anyway.)

  • Agree that correlation with cases or deaths is of course not the best metric. It's just a sanity check. But don't quite agree with your last statement. From the perspective of modeling, it's not easy to get an answer to 4 from answers to 1 and 3. An answer to 4 is a weird nonlinear function of 1 and 3 (for a single binary value, it's a logical OR, so the relationship between proportions is going to be weird). Probably the best thing to do is just to compute estimates for all of 1-4. They're all in principle interesting to someone. Give them informative names, put them in the API, we can figure out what makes most sense viz a viz modeling, later.

  • One last thing is to mention: we could try to figure out whether P(anyone sick | household size) varies with household size. Logan looked at this briefly when we were trying to pick an estimator, but nothing super thorough. Could be an interesting question to investigate thoroughly (not for forecasting, just generally) ...

from covidcast-indicators.

capnrefsmmat avatar capnrefsmmat commented on August 18, 2024

What criterion would you use to determine which combination of options 1 through 4 to provide on the map? You're right that we can't easily reconstruct 4 from answers to 1 and 3, e.g. in our forecasting/nowcasting code, but I don't think we want to report all of 1 through 4 on the map. That's a lot of fractions to explain. Would you find my argument that 1 and 3 are complementary more persuasive for deciding that we should map those two?

I think, for example, if our map had options 1 and 2, we'd have a hard time explaining how they are different, even though they are clearly different. The details of sampling may elude most of the public.

My thinking here is that for v1.2, we can definitely map options 1 and 3, since we have them already and need to just double-check option 3. Then we'd have a map with DV, two surveys, GHT, and the combination.

Adding options 2 and 4 to the API is possible, but we'd have only a short time to examine them before Tuesday. And if we're going to add lots of additional outputs, it may make sense to wait until Taylor can refactor the Facebook code so it is easy to rapidly make new metrics, like 2, 4, and something like the first point you suggested above.

from covidcast-indicators.

ryantibs avatar ryantibs commented on August 18, 2024

All good points. Just to be clear, I wasn't speaking about the map, just the signals in the API.

For v1.2, going with 1 and 3 makes sense, and agree they're complementary, so I'm fine with going forward with that. (Though just to say I don't know why computing 4 would be any harder than computing 3, it seems like it requires a very minor change.)

And waiting until the refactoring to hammer everything out also makes sense.

from covidcast-indicators.

krivard avatar krivard commented on August 18, 2024

@capnrefsmmat re our conversation this morning, the community signal only makes sense for the subset of all survey data that comes from the version that includes the community question. We still get a couple dozen responses daily from the old survey, even though facebook stopped providing that link to users on April 15.

The code that currently computes (3) is near the top of facebook-community/code/daily_csv_generate.R:

q5.by.zip = recoded.cleaned.partA %>>%
  dplyr::transmute(
    ZIP5=as.numeric(ZIP5),
    Day=Date,
    A4=as.numeric(A4),
    blank_is_not_sure=!is.na(SurveyID)) %>>%
  dplyr::filter(blank_is_not_sure | !is.na(A4)) %>>%
  dplyr::group_by(ZIP5, Day) %>>%
  dplyr::summarise(
    replied_yes=sum(A4>0, na.rm=TRUE),
    replied_no=sum(A4==0, na.rm=TRUE),
    replied_not_sure=sum(is.na(A4) & blank_is_not_sure)
  ) %>>%
  {.}

and automatically assigns not_sure if the user was offered the community question but declined to answer it; I was mirroring the existing Google community survey at the time. It turns out that community CLI doesn't do anything with the not_sure data, so we can probably drop that logic.

SurveyID is a column that was added at the same time as the community survey question, and may be useful for filtering if we want to do a proper demo of (4) for the map.

from covidcast-indicators.

krivard avatar krivard commented on August 18, 2024

For (4), we should decide how to complete the following:

community \ household number sick = NA number sick = number
number sick = NA drop ?
number sick = number ? sum

from covidcast-indicators.

krivard avatar krivard commented on August 18, 2024

Draft signal for (4) is up at wip_[smoothed|raw]_hhcmnty which just drops the ? cells in that table.

Indicators has found it good, and I've written it up in the Roni doc for naming.

from covidcast-indicators.

krivard avatar krivard commented on August 18, 2024

Approved name for (3) is [raw|smoothed]_nohh_cmnty_cli

Approved name for (4) is [raw|smoothed]_hh_cmnty_cli

Both are ready to add to API docs for v1.2.

from covidcast-indicators.

krivard avatar krivard commented on August 18, 2024

Successfully deployed in public API and live map.

from covidcast-indicators.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.