Hi, Thanks for the interesting analysis on rtlive.de! <p dir="au

Thanks for opening the PR <a class="user-mention notranslate" data-hovercard-type="use

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Outdated testcount data for Germany, incorrect Prophet forecast about rtlive-global HOT 26 OPEN

rtcovidlive commented on May 30, 2024

Outdated testcount data for Germany, incorrect Prophet forecast

from rtlive-global.

Comments (26)

lhelleckes commented on May 30, 2024 2

Thanks for opening the PR @MaPePeR and thanks for the great discussion here everyone.
@michaelosthege and me will have a look at it as soon as possible. Regarding the data, we just made a request to RKI and hope to get the latest numbers soon.

from rtlive-global.

MaPePeR commented on May 30, 2024 1

Yes, you understood correctly. I only shorten the daily data from BE to simulate the data situation we have in germany.
I only scale the part between the dotted lines, because I assume the weekly data is also not available completely. So I don't scale the full series.

At the moment I'm thinking if it would be more robust to don't base it on weeks and instead allow for arbitrary gaps in the data.
It needs two Series: the daily data and one with "weekly" sums or the weekly total/running sums. These could be provided in separate dataframes or in a single dataframe. To keep in the style of the existing methods I'd go for a single dataframe and adopt the get_testcounts_DE function to also add the oiwd-series to the frame.

Then it kind of depends if this method only should do the scaling or also the predicting. If it should not do the predicting itself, but act as a "postprocessing" step to the prediction it needs the predicted data and needs to know at which point the prediction starts/the known data ends.

At the moment this can only be used for the all region, because we don't have data for the other regions. (One could try to estimate those, but that's a totally different topic, so I would be happy to first improve the all region)
So it should also be region-specific and only work on a single region, by either supplying the region name or only supplying the Series for the specific region.

from rtlive-global.

michaelosthege commented on May 30, 2024 1

@MaPePeR we have since received new data from the RKI. The ArcGIS database is currently unavailable (they seem to have technical difficulties), which is why https://Rtlive.de was not yet updated with this new data.

We will check out your PR and compare the adjustments made by your function with the new data to see how much it could have improved the prediction.

from rtlive-global.

w-flo commented on May 30, 2024

Just had an idea: The RKI lab surveillance reports (see link in my issue report) indicate a test positivity rate for each Bundesland on page 5. It's a bit hard to read, but should be possible, maybe with some GIMP help. The test positivity rate combined with number of cases reported by the Bundesland in that week could be used to estimate the number of tests.

Edit: Note the table on page 4 is not useful because it aggregates numbers for 10 weeks. Current positivity rate needs to be extracted from graphs on page 5.

from rtlive-global.

michaelosthege commented on May 30, 2024

Hi @w-flo,
It is clearly a data availability problem. As far as I know the weekly RKI reports are based on that same underlying data.
Yesterday we got an update.

From a modeling perspective it should be possible to build a more complete PyMC3 model, maybe using the timeseers package, that models all of Germany in a hierarchical way, while also incorporating the different sums from the RKI report. It could be a fun challenge to build such a model and would certainly be a valuable contribution.
However, all of this data already exists, so an alternative approach would be to scrape such numbers from local health authorities. At least for Berlin I saw such a plot once.

from rtlive-global.

sdschulze commented on May 30, 2024

Is the nation-wide data easier to get, such that it would be possible to integrate it in the nation-wide plot?

from rtlive-global.

michaelosthege commented on May 30, 2024

@sdschulze I am not aware of any data source other than the weekly RKI reports. And those are just weekly numbers and based on essentially the same underlying data. As I wrote earlier, one could incorporate this in a more sophisticated model.

from rtlive-global.

MaPePeR commented on May 30, 2024

It seems like the weekly test counts are published a lot more frequent than the daily test counts (which is kind of sad/hilarious?). So maybe one can:

Take the known daily counts.
Make the prediction of daily counts as before.
Take the weekly test count (Potential Source: https://data.humdata.org/dataset/total-covid-19-tests-performed-by-country) and scale the predicted daily counts in a way the week-count of tests matches.
Repredict the test count for days for which we don't have a weekly test count using the sum-corrected data.

from rtlive-global.

michaelosthege commented on May 30, 2024

@MaPePeR Yes. If you want to start, take a look at the ourworldindata.py data source. It also covers Germany.

Unfortunately I cannot share the CSV of daily German testcounts, but for comparison you can just use the data loader from Belgium (https://github.com/rtcovidlive/rtlive-global/blob/master/notebooks/Tutorial_dataloading.ipynb).

By the way, there are many other countries from the OWID dataset that we could support. If anybody contributes a data loader for the UK, Ireland or Spain, that would be very welcome!

from rtlive-global.

MaPePeR commented on May 30, 2024

I can give it a try. My experience with panda is minimal, though.
And this correction does not really work for the Bundesländer, because we don't have that data.

I think I'm going to ignore them for now and only try to correct the country data.

Other possibilities would be:

Scaling the Bundesländer with the same factors as the summed data
Using the last 2 weeks with daily data to estimate the percentage of tests in each Bundesland and multiply it with the known country sum.
Do two prophet predictions: one for the average daily tests in a week in Germany and one for daily tests in Bundesland divided by the average daily test in a week. Then multiply the predictions to get values for each Bundesland.

Biggest problem I see is making this correction somewhat transparent to the readers. What do you think?

EDIT: At the moment I'm tending to implement the latest idea of predicting the weekly tests and the distribution of tests across weekdays and regions separately. But someone else should take a look at it and think about if that's really a good idea.

from rtlive-global.

MaPePeR commented on May 30, 2024

I attempted that 3rd Idea, but the prediction of weekly average tests didn't work out as planned. The prediciton of the "region factor" also doesn't look promising and prophet raised warnings i don't understand.

I kind of expected the prediction to look like the predicted tests for netherlands. They do not. (See [14] in my Notebook)

from rtlive-global.

michaelosthege commented on May 30, 2024

@MaPePeR you'll probably want to set this kwarg to False: https://github.com/rtcovidlive/rtlive-global/blob/master/rtlive/preprocessing.py#L108

from rtlive-global.

MaPePeR commented on May 30, 2024

I don't think this change improved the prediction:
With keep_data=True:

With keep_data=False:

from rtlive-global.

michaelosthege commented on May 30, 2024

@MaPePeR in your cell [12] the good looking yellow zig-zag line is df_be.diff(), but in [13] you're training the prophet model with f_be_weekly_avg_per_day

from rtlive-global.

MaPePeR commented on May 30, 2024

For that part I don't want to predict the yellow zig-zag line, but the blue staircase, so I assume df_be_weekly_avg_per_day to be correct. I expected the result to kind of follow the staircase as seen for the netherland test prediction.

I cannot reproduce the forecast for netherlands:

result_df, results = data_nl.forecast_NL(ourworldindata.create_loader_function("NL")(pandas.Timestamp('today')))
pyplot.plot(result_df.xs('all')['predicted_new_tests'], label='predicted')
pyplot.plot(result_df.xs('all')['new_tests'], label='tests')
pyplot.legend()

So maybe there is something wrong with my prophet installation or they behave differently when called from a notepad?
I followed the miniconda env instructions from the README.

from rtlive-global.

michaelosthege commented on May 30, 2024

The plot in this comment looks like prophet did not converge.

Your last plot here looks correct. On Rtlive.de/global we use process_testcounts(..., future_days=10) which is why the orange line in the "Netherlands, 2021-01-08" plot goes further into the future than yours.

Maybe I'm not seeing the problem you're referring to.

from rtlive-global.

MaPePeR commented on May 30, 2024

I now tried:

nl_data = ourworldindata.create_loader_function("NL")(pandas.Timestamp('today'))
result_df, results = data.process_testcounts('NL', nl_data, 10)
pyplot.plot(result_df.xs('all')['predicted_new_tests'], label='predicted')
pyplot.plot(result_df.xs('all')['new_tests'], label='tests')
pyplot.legend()

I was not expecting a saw-like prediction. But I now realized, that this might be a wrong assumption from my part and one only does not see this saw on rtlive.de/global because of graph scaling.

But I also realized, that my approach is wrong: I shouldn't use the prediction on data that is the same for all weekdays and then expect the prediction to also be the same on all weekdays. Instead I probably should use a predictor, that predicts the real weekly data instead of "fake" daily data.

from rtlive-global.

MaPePeR commented on May 30, 2024

I've gone back to my first idea of scaling the prediction directly and I think it looks promising: https://github.com/MaPePeR/rtlive-global/blob/correct_prediction_with_weekly_sum/notebooks/Correct_daily_count_with_weekly_sum.ipynb

Again, this is on belgian data, but simulating the german missing data.

from rtlive-global.

michaelosthege commented on May 30, 2024

@MaPePeR indeed, this looks very promising! Do I understand it correctly that you used a shortened daily data timeseries from BE to train and afterwards scale the prediction with the weekly sums of the full series?

Now to "get this into production" you'll have to make this a function with a very clearly defined signature (type hints + docstring). I suppose this function should take two dataframes - one with daily data and one with weekly sums?

from rtlive-global.

MaPePeR commented on May 30, 2024

I choose the 3-Series method signature and made the necessary changes and created yet another notebook in order to test it with belgian data again.

Looks like it works, but this needs review and testing, as I wasn't able to test the real data_de source, obviously.

Changes to the data_de source
The method is called in [11] in the Test-Notebook. This time i called the "real" get_data_BE andforecast_BE from data_be.

Not sure, if and when I should make a PR. Should I include any of the notebooks? I feel like they are not clean enough to include them.

from rtlive-global.

MaPePeR commented on May 30, 2024

I also now attempted to improve the region total testcounts in that same Notebook. The results also look very promising to me.

Two attempts:

Red: Just scaling the other regions with the same factors as the all-region
Black: I estimated the distribution of tests across the regions by calculating the percentage of total tests in the last week of known daily data for each region. That percentage was then used to scale the week-reports-daily change, which was then added as projected growth on top of the cumulated sum of tests for each region. This only works if the cumulated sum of daily data and report data are identical. I don't know if that's the case for the german data.
~~It also probably still needs some work if the daily-data reports and weekly-data reports don't align on the same day.~~ (I think I fixed this)

The region with the worst approximation:

One of the better predictions:

See notebook for all others.

from rtlive-global.

w-flo commented on May 30, 2024

Thanks for your work on this! I'm unable to review it though since I don't know anything about the implementation of this project right now, so I hope project owners will take a look. :-)

A similar approach could probably be used for each German state, but the weekly testcount report is not machine-readable (it's a PDF file) and they only publish the sum of tests for a sliding window of the past 10 weeks. By starting with their first report, testcount data for every single week could be inferred, I guess. Reports are at https://ars.rki.de/Content/COVID19/Main.aspx I guess the aggregate numbers in these PDF files won't necessarily match the "secret" daily testcount data the RKI provides rtlive.de with every couple of weeks, so there needs to be a state-specific scaling factor as well.

Right now the testcount prediction for SH is way off (really small because number of tests was decreasing in the weeks before Christmas) so rtlive.de estimates severe undertesting and a high R_t value, while testcount estimation for SN might be too high, leading to a smaller R_t and probably too small estimated infection rate. I hope RKI sends out new daily testcount data to rtlive.de soon to fix this.

from rtlive-global.

sdschulze commented on May 30, 2024

There seems to be a fluke right now, with the prediction about twice as high as the data.

from rtlive-global.

michaelosthege commented on May 30, 2024

There seems to be a fluke right now, with the prediction about twice as high as the data.

It's actually not a bug, but we just did not get around to update the plot with an informative legend.
The change is related to #12 where we implemented correcting the testcounts with weekly sums.

tl;dr: The dataset of daily testcounts we get from the RKI does not cover all tests, because not all laboratories report their testing volume.
We get this data about every month.
In the weekly reports the RKI publishes a sum of all tests done per week for all of Germany.
The 2x scaling is because we use this sum to scale up the testing profile to match the weekly sums reported publicly.

from rtlive-global.

sdschulze commented on May 30, 2024

Thanks for the explanation! I'm not sure if it's related, but now the Rt values by RKI have disappeared from the plot.

from rtlive-global.

michaelosthege commented on May 30, 2024

@sdschulze no, that's a new issue. See #17
Any help would be greatly appreciated!

from rtlive-global.

Outdated testcount data for Germany, incorrect Prophet forecast about rtlive-global HOT 26 OPEN

Comments (26)

Related Issues (10)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent