A known issue with the fb-survey is that if a user tagged for the survey forwards the survey to friends, their survey responses get tagged with the same identifier, but the only survey response to which the fb-generated weight applies is the user originally selected for the survey.
Previously, we have addressed this at the aggregation step by throwing out all but the earliest survey for each identifier. This works, but is slow, since it requires loading the cumulative list of all tokens and their earliest known start dates.
While transitioning the system to pseudo-incremental (where we dump partially-processed survey responses into a big bucket and store for the next run, so that we only have to fully process the last week's worth of data or so) I foolishly split off the step of generating the identifier list for a day in such a way that it does not get antijoined against the cumulative list. This has caused us to have duplicate identifier-weight pairs for 33 surveys going back to the very first week of the survey.
[1] "raw_wcli"
[1] "hrr"
date geo_id val.x se.x sample_size.x effective_sample_size.x
1 2020-04-06 155 1.3361589 0.8031495 181 178.0729
2 2020-04-08 113 0.4751491 0.1059180 2820 1758.6626
3 2020-04-09 145 0.7352730 0.3614189 439 356.8198
4 2020-04-09 223 0.7999764 0.5034265 2179 1445.3862
5 2020-04-30 56 0.3355428 0.1329136 1157 865.6997
val.y se.y sample_size.y effective_sample_size.y val.mismatch
1 1.5321662 0.8219147 182 178.9988 TRUE
2 0.4728166 0.1057039 2821 1703.0832 FALSE
3 0.7287457 0.3597578 440 349.8120 FALSE
4 0.7975445 0.5019053 2180 1435.6523 FALSE
5 0.3336672 0.1327289 1158 851.6769 FALSE
se.mismatch sample_size.mismatch effective_sample_size.mismatch
1 FALSE FALSE FALSE
2 FALSE FALSE TRUE
3 FALSE FALSE TRUE
4 FALSE FALSE TRUE
5 FALSE FALSE TRUE
[1] "raw_wcli"
[1] "msa"
date geo_id val.x se.x sample_size.x effective_sample_size.x
1 2020-04-08 47900 0.6446462 0.1258031 3912.972 2407.896
2 2020-04-30 31080 0.3344158 0.1191181 1557.844 1170.382
val.y se.y sample_size.y effective_sample_size.y val.mismatch
1 0.6423696 0.1254606 3913.972 2354.284 FALSE
2 0.3330240 0.1188767 1558.844 1156.206 FALSE
se.mismatch sample_size.mismatch effective_sample_size.mismatch
1 FALSE FALSE TRUE
2 FALSE FALSE TRUE
[1] "raw_wcli"
[1] "state"
date geo_id val.x se.x sample_size.x effective_sample_size.x
1 2020-04-08 md 0.5978971 0.10563249 5852.989 3858.5343
2 2020-04-09 md 0.6106332 0.23702405 4831.990 3285.0041
3 2020-04-10 nh 0.5424426 0.25401259 652.000 467.4285
4 2020-04-30 ca 0.3513494 0.06685219 8207.078 6220.2927
val.y se.y sample_size.y effective_sample_size.y val.mismatch
1 0.5965956 0.1054311 5853.989 3805.2662 FALSE
2 0.6097974 0.2366978 4832.990 3274.1975 FALSE
3 0.5608735 0.2585968 653.000 489.9669 FALSE
4 0.3510655 0.0668012 8208.078 6205.3173 FALSE
se.mismatch sample_size.mismatch effective_sample_size.mismatch
1 FALSE FALSE TRUE
2 FALSE FALSE TRUE
3 FALSE FALSE TRUE
4 FALSE FALSE TRUE
[1] "raw_wcli"
[1] "county"
date geo_id val.x se.x sample_size.x effective_sample_size.x
1 2020-04-06 17031 0.8046518 0.2241702 1166.0908 665.7235
2 2020-04-08 24021 0.3922447 0.2390339 324.7628 344.3633
3 2020-04-30 06037 0.3346056 0.1328722 1151.7269 864.2107
val.y se.y sample_size.y effective_sample_size.y val.mismatch
1 0.8248768 0.2249927 1167.0908 666.2835 TRUE
2 0.3815037 0.2223784 325.7628 382.5789 FALSE
3 0.3327233 0.1326897 1152.7269 850.0887 FALSE
se.mismatch sample_size.mismatch effective_sample_size.mismatch
1 FALSE FALSE FALSE
2 FALSE FALSE TRUE
3 FALSE FALSE TRUE