Comments (12)
Nice work identifying that example! Looping in @neul3 as he's growing more familiar with debugging these kinds of issues using the DB.
The database records show at least one additional problem, and that is a misfiled is_latest_issue flag:
MariaDB [epidata]> select value, is_latest_issue, issue, from_unixtime(value_updated_timestamp) from covidcast where source="jhu-csse" and `signal`="confirmed_incidence_num" and geo_type="state" and geo_value="ca" and time_type="day" and time_value=20200403 and issue>20210000;
+-------+-----------------+----------+----------------------------------------+
| value | is_latest_issue | issue | from_unixtime(value_updated_timestamp) |
+-------+-----------------+----------+----------------------------------------+
| 1227 | 0 | 20210126 | 2021-10-18 11:29:04 |
| 1227 | 0 | 20210401 | 2021-10-22 17:44:36 |
| 1235 | 0 | 20210414 | 2021-10-22 17:44:36 |
| 1293 | 0 | 20210916 | 2021-09-16 14:26:24 |
| 1293 | 0 | 20210917 | 2021-10-22 17:44:36 |
| 1295 | 1 | 20210927 | 2021-10-22 17:44:36 |
| 1324 | 0 | 20211030 | 2021-10-30 06:04:45 |
+-------+-----------------+----------+----------------------------------------+
7 rows in set (0.00 sec)
What seems to have happened is:
- The value was updated on Sept 16 using the daily acquisition pipeline.
- The value was updated on Oct 22 as part of a data patch including issues from April and September. This patch may have added new issues or updated old ones; there's no way to tell. It did however claim the
is_latest_issue
for this geo-temporal point at issue 20210927. - The value was updated on Oct 30 using the daily acquisition pipeline. It should have claimed
is_latest_issue
but, for whatever reason, did not.
The immediate solution in v3 would be to find a quiet afternoon and run an exhaustive is_latest_issue
repair, because if this happened once, it probably has happened multiple times. With the v4 rollout, there was half a chance the data migration would have fixed this particular fault automatically when splitting data between the latest and history tables, but alas it looks like the migration routine treated the is_latest_issue
flags as authoritative as a time-saving measure. We'll need to adapt the old v3 is-latest repair routines for v4.
The upside is that the cache didn't lose sync, at least not in this instance. Is it possible to adapt your evaluation script to double-check that what's being reported as the latest issue doesn't have a sneaky more-recent sibling? Something like
covidcast.signal("jhu-csse", "confirmed_incidence_num", start_day=date(2020, 4, 3), end_day=date(2020, 4, 3), geo_type="state", geo_values="ca",issues=[date(2021, 9, 27), date.today()])
might do it.
from covidcast-indicators.
Patch is in! Spot check:
mysql> select value, issue, from_unixtime(value_updated_timestamp) from epimetric_full_v where source="jhu-csse" and `signal`="confirmed_incidence_num" and geo_type="state" and geo_value="ca" and time_type="day" and time_value=20200403 and issue>20210000;
+-------+----------+----------------------------------------+
| value | issue | from_unixtime(value_updated_timestamp) |
+-------+----------+----------------------------------------+
| 1227 | 20210126 | 2021-10-18 15:29:04 |
| 1227 | 20210401 | 2021-10-22 21:44:36 |
| 1235 | 20210414 | 2021-10-22 21:44:36 |
| 1293 | 20210916 | 2021-09-16 18:26:24 |
| 1293 | 20210917 | 2021-10-22 21:44:36 |
| 1295 | 20210927 | 2021-10-22 21:44:36 |
| 1324 | 20211030 | 2021-10-30 10:04:45 |
+-------+----------+----------------------------------------+
7 rows in set (0.27 sec)
mysql> select value, issue, from_unixtime(value_updated_timestamp) from epimetric_latest_v where source="jhu-csse" and `signal`="confirmed_incidence_num" and geo_type="state" and geo_value="ca" and time_type="day" and time_value=20200403 and issue>20210000;
+-------+----------+----------------------------------------+
| value | issue | from_unixtime(value_updated_timestamp) |
+-------+----------+----------------------------------------+
| 1324 | 20211030 | 2021-10-30 10:04:45 |
+-------+----------+----------------------------------------+
1 row in set (0.00 sec)
@dshemetov can you see if there are any remaining bad values in jhu-csse?
from covidcast-indicators.
I have a batch of other possible errors in the incidence signal. This is coming from the comparison between live code and JIT. The live code value
column is on the left, while the JIT code is on the right. You can se that the -3.0 difference on 20200630 is another live code error in the above queries.
value
self other
geo_value time_value
02100 20200624 0.0 2.0
20200630 0.0 -3.0
20200715 0.0 5.0
20200827 0.0 -3.0
20200828 0.0 3.0
20201007 0.0 -4.0
20201009 0.0 4.0
20201012 0.0 -4.0
20201017 0.0 5.0
06003 20201110 2.0 0.0
20210105 3.0 0.0
06017 20200511 2.0 1.0
20200515 5.0 2.0
20200522 0.0 7.0
20200523 7.0 0.0
20200524 0.0 2.0
20200525 4.0 2.0
20200526 0.0 1.0
20200528 4.0 0.0
20200529 0.0 3.0
20200530 6.0 1.0
20200531 0.0 2.0
20200601 6.0 0.0
20200603 0.0 2.0
20200604 2.0 11.0
20200605 4.0 0.0
20200606 4.0 3.0
20200609 2.0 1.0
20200610 0.0 1.0
20200612 2.0 1.0
20200617 3.0 1.0
20200618 4.0 5.0
20200619 5.0 3.0
20200620 0.0 6.0
20200621 0.0 2.0
20200622 15.0 5.0
20200701 20.0 8.0
20200702 8.0 10.0
20200703 0.0 8.0
20200704 20.0 10.0
20200705 0.0 16.0
20200706 0.0 5.0
20200708 16.0 4.0
20200709 12.0 7.0
20200711 0.0 17.0
20200712 0.0 7.0
20200713 0.0 8.0
20200714 43.0 26.0
20200827 6.0 4.0
20201007 9.0 8.0
20201008 13.0 11.0
20201012 2.0 12.0
20201013 17.0 8.0
20201014 3.0 2.0
20201016 7.0 8.0
20201110 35.0 36.0
20201111 21.0 13.0
06027 20200508 0.0 -1.0
20200509 0.0 1.0
20200511 0.0 -1.0
20200515 0.0 1.0
20200516 0.0 -1.0
20200527 0.0 1.0
20200528 0.0 -1.0
20200624 2.0 1.0
20200701 1.0 0.0
20200702 0.0 -1.0
20200703 1.0 3.0
20200708 0.0 3.0
20200712 1.0 0.0
20200714 2.0 0.0
20200827 4.0 0.0
20201007 2.0 1.0
20201013 3.0 6.0
20201014 4.0 2.0
20201016 1.0 2.0
20201110 4.0 0.0
20210108 0.0 26.0
ak 20200506 1.0 2.0
20200508 3.0 -3.0
20200509 2.0 8.0
20200511 0.0 -7.0
20200515 2.0 15.0
20200516 0.0 -10.0
20200527 1.0 21.0
20200528 13.0 -7.0
20200624 25.0 180.0
20200630 36.0 -159.0
20200715 55.0 394.0
20200731 111.0 109.0
20200817 50.0 51.0
20201014 142.0 -883.0
20201015 172.0 1221.0
20201110 523.0 514.0
20201112 485.0 492.0
20210105 193.0 174.0
20210106 338.0 360.0
20210108 399.0 372.0
20210109 320.0 341.0
ca 20200220 0.0 2.0
20200403 1295.0 1324.0
20200404 795.0 746.0
20200406 1549.0 1510.0
20200407 1125.0 1097.0
20200409 938.0 960.0
20200410 1249.0 1216.0
20200412 768.0 743.0
20200415 1813.0 1812.0
20200416 1229.0 1222.0
20200421 2139.0 2191.0
20200428 1440.0 1515.0
20200501 1819.0 1834.0
20200503 1498.0 1367.0
20200518 1291.0 1446.0
20200523 2287.0 2067.0
20200531 2056.0 2333.0
20200607 2022.0 2535.0
20200614 2282.0 2550.0
20200830 4118.0 4136.0
20210703 653.0 679.0
from covidcast-indicators.
I computed the incidence signal a third way on this same dataset, using Pandas groupby and diff, and found that the JIT code had no differences, while the live code had the same list as above.
from covidcast-indicators.
Ok, now this is very strange:
- I ran the indicator and generated output CSVs
- I pulled the data from the CSV files into a dataframe
- I manually diffed the cumulative signal and compared it to the incidence signal and they match up!
So the API has different data from what the indicator output CSV files contain.
So something might be wrong with the archiver??? 👀 Either:
- the archiver is not producing correct diffs
- the archiver produced correct diffs, but they never got uploaded to the db, so it's out of sync with db
from covidcast-indicators.
I used the county 02100 data above and tested the archiver on the incidence data (created a before file with 02100 20200624 0.0
and an after file with 02100 20200624 2.0
to make sure the diff is detected). The diff is correctly found, see #1687.
from covidcast-indicators.
Alright, so at least one instance of cache desync confirmed:
# API value: 1295.0
import covidcast
from datetime import date
covidcast.signal("jhu-csse", "confirmed_incidence_num", start_day=date(2020, 4, 3), end_day=date(2020, 4, 3), geo_type="state", geo_values="ca")
# 0 ca confirmed_incidence_num 2020-04-03 2021-09-27 542 0 0 0 1295.0 None None state jhu-csse
# Cache value: 1324.0
import boto3
from io import StringIO
s3 = boto3.Session(profile_name='delphi').resource('s3')
obj = s3.Object('delphi-covidcast-indicator-output', 'jhu/20200403_state_confirmed_incidence_num.csv')
csv_contents = obj.get()['Body'].read().decode('utf-8')
pd.read_csv(StringIO(csv_contents)).query("geo_id == 'ca'")
# 5 ca 1324.0 NaN NaN
# Local indicator output value: 1324.0
pd.read_csv("receiving/20200403_state_confirmed_incidence_num.csv").query("geo_id == 'ca'")
# 5 ca 1324.0 NaN NaN
So the cache thinks the DB already has the latest value in its latest issue, whereas it actually doesn't, so no update occurs to the new value.
I guess I'll download the full S3 cache, download the full API dataset, and see how many values are in this situation.
from covidcast-indicators.
Good find on the is_latest_issue
problem! And thanks for that code (didn't know you could pass a range to issue)!
from covidcast-indicators.
A summary of work done today with @nolangormley:
- we started to measure the scope of the rows with incorrect
is_latest_issue
flags - out of ~500k
is_latest_issue=1
rows withsignal=confirmed_incidence_nums
andgeo_type=state
, we had ~5k rows where there was a more recent issue available - these rows all had
time_value
prior to 20211020, suggesting that this is due to left over bugs with the patch that went out at that time (and not a persisting problem with indicators or another part of the pipeline)
Next steps:
- get a count of the total number of rows with incorrect
is_latest_issue
and get a breakdown of these by (signal, geo)- need an optimized DB query for this, since this will potentially search all our indicators; here is a starting point (h/t @krivard):
SELECT * FROM covidcast c JOIN (SELECT source, signal, geo_type, geo_value, time_type, time_value, issue latest_issue FROM covidcast t WHERE is_latest_issue=1) x USING (source, signal, geo_type, geo_value, time_type, time_value) WHERE issue>=latest_issue
- need an optimized DB query for this, since this will potentially search all our indicators; here is a starting point (h/t @krivard):
- build a table of all the rows that need to be repaired; we can use this table to create a patch DB update statement; the rows need the
(source, signal, geo_type, geo_value, time_type, time_value, issue, is_latest_issue)
fields - prepare a repair announcement to the mailing list
- schedule a time to run this fix
- estimate its impact on the DB (can do this after the table of rows to be repaired is constructed)
- consider writing a crawler utility to periodically sanity check our whole database for computed fields like this in downtimes
Next steps for JIT:
- partially blocked on this getting fixed
- get A/B tests ready for when this is fixed
- when this is fixed, if there are discrepancies, consider checking for an archiver cache desync (requires downloading a lot of data from S3; hopefully we don't have to do this)
from covidcast-indicators.
I believe the next steps here -- as laid out by Dmitry above -- is to check the archive differ s3 bucket to ensure that our cache is in sync
from covidcast-indicators.
Took a quick look in JIT comparisons of derived signals and patched JHU data was free of cache sync issues.
from covidcast-indicators.
🥳
from covidcast-indicators.
Related Issues (20)
- replace the python covidcast client in google-symptoms HOT 1
- Add discussion of patching functionality to indicator manual
- Log environment details for forensic purposes
- fix google symptom backfilling behavior HOT 1
- NSSP patching script and NSSP patch april 2024 to july 2024
- Add discussion of `params.json` and defaults to indicator manual HOT 2
- Validator not working for sources with weekly data HOT 1
- add issue_date as a part of the export directory for regular runs HOT 1
- Add actionable logging to patching runs HOT 3
- Optimize Doctor Visit
- Patch 07/31/2014 data for hospital admission HOT 1
- patch doctor visit issue date 20240731 and nchs issue date 20240801 HOT 5
- rsync change/fix so multiple indicators can run at the same time HOT 1
- migrate setup.py to pyproject.toml
- Turn down "Transfer in progress" message frequency HOT 1
- Adjust for deprecated pandas groupby.sum argument
- migrate setup.py to pyproject.toml HOT 1
- Patch 05/22-06/29 data for google symptoms
- Evaluate usefulness of fuzzy checks in validator
- Fix unstructured logging variable formatting HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from covidcast-indicators.