Code Monkey home page Code Monkey logo

Comments (12)

krivard avatar krivard commented on August 18, 2024 1

Nice work identifying that example! Looping in @neul3 as he's growing more familiar with debugging these kinds of issues using the DB.

The database records show at least one additional problem, and that is a misfiled is_latest_issue flag:

MariaDB [epidata]> select value, is_latest_issue, issue, from_unixtime(value_updated_timestamp) from covidcast where source="jhu-csse" and `signal`="confirmed_incidence_num" and geo_type="state" and geo_value="ca" and time_type="day" and time_value=20200403 and issue>20210000;
+-------+-----------------+----------+----------------------------------------+
| value | is_latest_issue | issue    | from_unixtime(value_updated_timestamp) |
+-------+-----------------+----------+----------------------------------------+
|  1227 | 0               | 20210126 | 2021-10-18 11:29:04                    |
|  1227 | 0               | 20210401 | 2021-10-22 17:44:36                    |
|  1235 | 0               | 20210414 | 2021-10-22 17:44:36                    |
|  1293 | 0               | 20210916 | 2021-09-16 14:26:24                    |
|  1293 | 0               | 20210917 | 2021-10-22 17:44:36                    |
|  1295 | 1               | 20210927 | 2021-10-22 17:44:36                    |
|  1324 | 0               | 20211030 | 2021-10-30 06:04:45                    |
+-------+-----------------+----------+----------------------------------------+
7 rows in set (0.00 sec)

What seems to have happened is:

  1. The value was updated on Sept 16 using the daily acquisition pipeline.
  2. The value was updated on Oct 22 as part of a data patch including issues from April and September. This patch may have added new issues or updated old ones; there's no way to tell. It did however claim the is_latest_issue for this geo-temporal point at issue 20210927.
  3. The value was updated on Oct 30 using the daily acquisition pipeline. It should have claimed is_latest_issue but, for whatever reason, did not.

The immediate solution in v3 would be to find a quiet afternoon and run an exhaustive is_latest_issue repair, because if this happened once, it probably has happened multiple times. With the v4 rollout, there was half a chance the data migration would have fixed this particular fault automatically when splitting data between the latest and history tables, but alas it looks like the migration routine treated the is_latest_issue flags as authoritative as a time-saving measure. We'll need to adapt the old v3 is-latest repair routines for v4.

The upside is that the cache didn't lose sync, at least not in this instance. Is it possible to adapt your evaluation script to double-check that what's being reported as the latest issue doesn't have a sneaky more-recent sibling? Something like

covidcast.signal("jhu-csse", "confirmed_incidence_num", start_day=date(2020, 4, 3), end_day=date(2020, 4, 3), geo_type="state", geo_values="ca",issues=[date(2021, 9, 27), date.today()])

might do it.

from covidcast-indicators.

krivard avatar krivard commented on August 18, 2024 1

Patch is in! Spot check:

mysql> select value, issue, from_unixtime(value_updated_timestamp) from epimetric_full_v where source="jhu-csse" and `signal`="confirmed_incidence_num" and geo_type="state" and geo_value="ca" and time_type="day" and time_value=20200403 and issue>20210000;
+-------+----------+----------------------------------------+
| value | issue    | from_unixtime(value_updated_timestamp) |
+-------+----------+----------------------------------------+
|  1227 | 20210126 | 2021-10-18 15:29:04                    |
|  1227 | 20210401 | 2021-10-22 21:44:36                    |
|  1235 | 20210414 | 2021-10-22 21:44:36                    |
|  1293 | 20210916 | 2021-09-16 18:26:24                    |
|  1293 | 20210917 | 2021-10-22 21:44:36                    |
|  1295 | 20210927 | 2021-10-22 21:44:36                    |
|  1324 | 20211030 | 2021-10-30 10:04:45                    |
+-------+----------+----------------------------------------+
7 rows in set (0.27 sec)

mysql> select value, issue, from_unixtime(value_updated_timestamp) from epimetric_latest_v where source="jhu-csse" and `signal`="confirmed_incidence_num" and geo_type="state" and geo_value="ca" and time_type="day" and time_value=20200403 and issue>20210000;
+-------+----------+----------------------------------------+
| value | issue    | from_unixtime(value_updated_timestamp) |
+-------+----------+----------------------------------------+
|  1324 | 20211030 | 2021-10-30 10:04:45                    |
+-------+----------+----------------------------------------+
1 row in set (0.00 sec)

@dshemetov can you see if there are any remaining bad values in jhu-csse?

from covidcast-indicators.

dshemetov avatar dshemetov commented on August 18, 2024

I have a batch of other possible errors in the incidence signal. This is coming from the comparison between live code and JIT. The live code value column is on the left, while the JIT code is on the right. You can se that the -3.0 difference on 20200630 is another live code error in the above queries.

                       value        
                        self   other
geo_value time_value                
02100     20200624       0.0     2.0
          20200630       0.0    -3.0
          20200715       0.0     5.0
          20200827       0.0    -3.0
          20200828       0.0     3.0
          20201007       0.0    -4.0
          20201009       0.0     4.0
          20201012       0.0    -4.0
          20201017       0.0     5.0
06003     20201110       2.0     0.0
          20210105       3.0     0.0
06017     20200511       2.0     1.0
          20200515       5.0     2.0
          20200522       0.0     7.0
          20200523       7.0     0.0
          20200524       0.0     2.0
          20200525       4.0     2.0
          20200526       0.0     1.0
          20200528       4.0     0.0
          20200529       0.0     3.0
          20200530       6.0     1.0
          20200531       0.0     2.0
          20200601       6.0     0.0
          20200603       0.0     2.0
          20200604       2.0    11.0
          20200605       4.0     0.0
          20200606       4.0     3.0
          20200609       2.0     1.0
          20200610       0.0     1.0
          20200612       2.0     1.0
          20200617       3.0     1.0
          20200618       4.0     5.0
          20200619       5.0     3.0
          20200620       0.0     6.0
          20200621       0.0     2.0
          20200622      15.0     5.0
          20200701      20.0     8.0
          20200702       8.0    10.0
          20200703       0.0     8.0
          20200704      20.0    10.0
          20200705       0.0    16.0
          20200706       0.0     5.0
          20200708      16.0     4.0
          20200709      12.0     7.0
          20200711       0.0    17.0
          20200712       0.0     7.0
          20200713       0.0     8.0
          20200714      43.0    26.0
          20200827       6.0     4.0
          20201007       9.0     8.0
          20201008      13.0    11.0
          20201012       2.0    12.0
          20201013      17.0     8.0
          20201014       3.0     2.0
          20201016       7.0     8.0
          20201110      35.0    36.0
          20201111      21.0    13.0
06027     20200508       0.0    -1.0
          20200509       0.0     1.0
          20200511       0.0    -1.0
          20200515       0.0     1.0
          20200516       0.0    -1.0
          20200527       0.0     1.0
          20200528       0.0    -1.0
          20200624       2.0     1.0
          20200701       1.0     0.0
          20200702       0.0    -1.0
          20200703       1.0     3.0
          20200708       0.0     3.0
          20200712       1.0     0.0
          20200714       2.0     0.0
          20200827       4.0     0.0
          20201007       2.0     1.0
          20201013       3.0     6.0
          20201014       4.0     2.0
          20201016       1.0     2.0
          20201110       4.0     0.0
          20210108       0.0    26.0
ak        20200506       1.0     2.0
          20200508       3.0    -3.0
          20200509       2.0     8.0
          20200511       0.0    -7.0
          20200515       2.0    15.0
          20200516       0.0   -10.0
          20200527       1.0    21.0
          20200528      13.0    -7.0
          20200624      25.0   180.0
          20200630      36.0  -159.0
          20200715      55.0   394.0
          20200731     111.0   109.0
          20200817      50.0    51.0
          20201014     142.0  -883.0
          20201015     172.0  1221.0
          20201110     523.0   514.0
          20201112     485.0   492.0
          20210105     193.0   174.0
          20210106     338.0   360.0
          20210108     399.0   372.0
          20210109     320.0   341.0
ca        20200220       0.0     2.0
          20200403    1295.0  1324.0
          20200404     795.0   746.0
          20200406    1549.0  1510.0
          20200407    1125.0  1097.0
          20200409     938.0   960.0
          20200410    1249.0  1216.0
          20200412     768.0   743.0
          20200415    1813.0  1812.0
          20200416    1229.0  1222.0
          20200421    2139.0  2191.0
          20200428    1440.0  1515.0
          20200501    1819.0  1834.0
          20200503    1498.0  1367.0
          20200518    1291.0  1446.0
          20200523    2287.0  2067.0
          20200531    2056.0  2333.0
          20200607    2022.0  2535.0
          20200614    2282.0  2550.0
          20200830    4118.0  4136.0
          20210703     653.0   679.0

from covidcast-indicators.

dshemetov avatar dshemetov commented on August 18, 2024

I computed the incidence signal a third way on this same dataset, using Pandas groupby and diff, and found that the JIT code had no differences, while the live code had the same list as above.

from covidcast-indicators.

dshemetov avatar dshemetov commented on August 18, 2024

Ok, now this is very strange:

  • I ran the indicator and generated output CSVs
  • I pulled the data from the CSV files into a dataframe
  • I manually diffed the cumulative signal and compared it to the incidence signal and they match up!

So the API has different data from what the indicator output CSV files contain.

So something might be wrong with the archiver??? 👀 Either:

  1. the archiver is not producing correct diffs
  2. the archiver produced correct diffs, but they never got uploaded to the db, so it's out of sync with db

from covidcast-indicators.

dshemetov avatar dshemetov commented on August 18, 2024

I used the county 02100 data above and tested the archiver on the incidence data (created a before file with 02100 20200624 0.0 and an after file with 02100 20200624 2.0 to make sure the diff is detected). The diff is correctly found, see #1687.

from covidcast-indicators.

dshemetov avatar dshemetov commented on August 18, 2024

Alright, so at least one instance of cache desync confirmed:

# API value: 1295.0
import covidcast
from datetime import date
covidcast.signal("jhu-csse", "confirmed_incidence_num", start_day=date(2020, 4, 3), end_day=date(2020, 4, 3), geo_type="state", geo_values="ca")
# 0	ca	confirmed_incidence_num	2020-04-03	2021-09-27	542	0	0	0	1295.0	None	None	state	jhu-csse

# Cache value: 1324.0
import boto3
from io import StringIO
s3 = boto3.Session(profile_name='delphi').resource('s3')
obj = s3.Object('delphi-covidcast-indicator-output', 'jhu/20200403_state_confirmed_incidence_num.csv')
csv_contents = obj.get()['Body'].read().decode('utf-8')
pd.read_csv(StringIO(csv_contents)).query("geo_id == 'ca'")
# 5	ca	1324.0	NaN	NaN

# Local indicator output value: 1324.0
pd.read_csv("receiving/20200403_state_confirmed_incidence_num.csv").query("geo_id == 'ca'")
# 5	ca	1324.0	NaN	NaN

So the cache thinks the DB already has the latest value in its latest issue, whereas it actually doesn't, so no update occurs to the new value.

I guess I'll download the full S3 cache, download the full API dataset, and see how many values are in this situation.

from covidcast-indicators.

dshemetov avatar dshemetov commented on August 18, 2024

Good find on the is_latest_issue problem! And thanks for that code (didn't know you could pass a range to issue)!

from covidcast-indicators.

dshemetov avatar dshemetov commented on August 18, 2024

A summary of work done today with @nolangormley:

  • we started to measure the scope of the rows with incorrect is_latest_issue flags
  • out of ~500k is_latest_issue=1 rows with signal=confirmed_incidence_nums and geo_type=state, we had ~5k rows where there was a more recent issue available
  • these rows all had time_value prior to 20211020, suggesting that this is due to left over bugs with the patch that went out at that time (and not a persisting problem with indicators or another part of the pipeline)

Next steps:

  • get a count of the total number of rows with incorrect is_latest_issue and get a breakdown of these by (signal, geo)
    • need an optimized DB query for this, since this will potentially search all our indicators; here is a starting point (h/t @krivard): SELECT * FROM covidcast c JOIN (SELECT source, signal, geo_type, geo_value, time_type, time_value, issue latest_issue FROM covidcast t WHERE is_latest_issue=1) x USING (source, signal, geo_type, geo_value, time_type, time_value) WHERE issue>=latest_issue
  • build a table of all the rows that need to be repaired; we can use this table to create a patch DB update statement; the rows need the (source, signal, geo_type, geo_value, time_type, time_value, issue, is_latest_issue) fields
  • prepare a repair announcement to the mailing list
  • schedule a time to run this fix
    • estimate its impact on the DB (can do this after the table of rows to be repaired is constructed)
  • consider writing a crawler utility to periodically sanity check our whole database for computed fields like this in downtimes

Next steps for JIT:

  • partially blocked on this getting fixed
  • get A/B tests ready for when this is fixed
  • when this is fixed, if there are discrepancies, consider checking for an archiver cache desync (requires downloading a lot of data from S3; hopefully we don't have to do this)

from covidcast-indicators.

nolangormley avatar nolangormley commented on August 18, 2024

I believe the next steps here -- as laid out by Dmitry above -- is to check the archive differ s3 bucket to ensure that our cache is in sync

from covidcast-indicators.

dshemetov avatar dshemetov commented on August 18, 2024

Took a quick look in JIT comparisons of derived signals and patched JHU data was free of cache sync issues.

from covidcast-indicators.

krivard avatar krivard commented on August 18, 2024

🥳

from covidcast-indicators.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.