Code Monkey home page Code Monkey logo

covid-19-germany-gae's Introduction

COVID-19 case numbers in Germany by state, over time 😷

(COVID-19 Fallzahlen für Deutschland)

Landing page: https://covid19-germany.appspot.com

This dataset is provided through comma-separated value (CSV) files. In addition, this project offers an HTTP (JSON) API.

How is this dataset different from others?

  • It includes historical data for individual Bundesländer and Landkreise (states and counties).
  • Its time series data is being re-written as data gets better over time. This is based on official RKI-provided time series data which receives daily updates even for days weeks in the past (accounting for delay in reporting).
  • The HTTP endpoint /now consults multiple sources (and has changed its sources over time) to be as fresh and credible as possible while maintaining a stable interface.

Quick overview

  • Data files:
    • cases-rki-*.csv and deaths-rki-*.csv: history, based on Robert Koch-Institut data, most credible view into the past: accounts for Meldeverzug. The historical evolution of data points in here is updated daily based on the (less accessible) RKI ArcGIS system.
    • ags.json: a map for translating "amtlicher Gemeindeschlüssel" (AGS) to Landreis/Bundesland details.
    • cases-rl-*.csv: history, based on Risklayer crowdsource effort.
    • data.csv: history, mixed data source based on RKI/ZEIT ONLINE, drives API.
  • JSON endpoint for the current state: /now
  • JSON endpoint for time series, example for Bayern: /timeseries/DE-BY/cases, based on data.csv
  • Endpoints for other states linked from this landing page: https://covid19-germany.appspot.com

Contact, questions, contributions

You probably have many questions, just as I did (and still do). Your feedback and questions are highly appreciated! Please use the GitHub issue tracker (preferred) or contact me via mail at [email protected].

What you should know before reading these numbers

Please question the conclusiveness of these numbers. Some directions along which you may want to think:

  • Although Germany seems to perform a large number of tests, we (the public) do not have good insight into how the testing rate (and its spatial distribution) evolves over time. In my opinion, one absolutely should know a whole lot about the testing effort itself before drawing conclusions from the time evolution of case count numbers.
  • Each confirmed case is implicitly associated with a reporting date. We do not know for sure how that reporting date relates to the date of taking the sample (there might be days between those two points in time).
  • We believe that each "confirmed case" actually corresponds to a polymerase chain reaction (PCR) test for the SARS-CoV2 virus with a positive outcome. This is quite probably true, but we cannot verify this end-to-end, we have to trust Landkreise, doctors, and labs.
  • Yet, we seem to believe that the change of the number of confirmed COVID-19 cases over time is somewhat expressive: but what does it shed light on, exactly? The amount of testing performed, and its spatial coverage? The efficiency with which the virus spreads through the population ("basic reproduction number")? The actual, absolute number of people infected? The virus' potential to exhibit COVID-19 in an infected human body?

If you keep these (and more) ambiguities and questions in mind then I think you are ready to look at these numbers :-) 😷.

Changelog: data source

In Germany, every step along the chain of reporting (Meldekette) introduces a noticeable delay. This is not necessary, but sadly the current state of affairs. The Robert Koch-Institut (RKI) seems to be working on a more modern reporting system that might mitigate some of these delays along the Meldekette in the future. Until then, it is fair to assume that case numbers published by RKI have 1-2 days delay over the case numbers published by Landkreise, which themselves have an unknown lag relative to the physical tests. In some cases, the Meldekette might even be entirely disrupted, as discussed in this SPIEGEL article (German). Also see this discussion.

Wishlist: every case should be tracked with its own time line, and transparently change state over time. The individual cases (and their time lines) should be aggregated on a country-wide level, anonymously, and get published in almost real time, through an official, structured data source, free to consume for everyone.

As discussed, the actual data flow situation is far from this ideal. Nevertheless, the primary concern of this dataset here is to maximize data credibility while also trying to maximize data freshness; a challenging trade-off in this initial phase of pandemic growth in Germany. That is, the goal is to provide you with the least shitty numbers from a set of generally pretty shitty numbers. To that end, I took liberty to iterate on the data source behind this dataset — as indicated below.

/now (current state):

  • Since (incl) March 26: Meldekette step 2: reports by the individual counties (Landkreise), curated by Tagesspiegel and Risklayer for the current case count, curated by ZEIT ONLINE for deaths.
  • Since (incl) March 24: Meldekette step 2: reports by the individual counties (Landkreise), curated by ZEIT ONLINE.
  • Since (incl) March 19: Meldekette step 3: reports by the individual states (Bundesländer), curated by ZEIT ONLINE, and Berliner Morgenpost.

/timeseries/... (historical data):

Update (evening March 29): in the near future I consider re-writing the history exposed by these endpoints (data.csv) using RKI data, accounting for long reporting delays.

  • Since (incl) March 24: Meldekette step 2: reports by the individual counties (Landkreise), curated by ZEIT ONLINE.
  • Since (incl) March 18: Meldekette step 3: reports by the individual states (Bundesländer), curated by ZEIT ONLINE.
  • Before March 18: Meldekette step 4: RKI "situation reports" (PDF documents).

Note:

  • The source identifier in the CSV file changes correspondingly over time.
  • A mix of sources in a time series is of course far from ideal. However — given the boundary conditions — I think switching to better sources as they come up is fair and useful. We might also change (read: rewrite) time series data in hindsight. Towards enhancing overall credibility. That has not happened yet, but that can change as we learn more about the Germany-internal data flow, and about the credibility of individual data sources.

Quality data sources published by Bundesländer

I tried to discover these step-by-step, they are possibly underrated:

Plots

Confirmed COVID-19 cases over time with exponential fit, for

Automatically generated based on this data set, but possibly not every day.

Further resources:

CSV file details

  • The column names use the ISO 3166 code for individual states.
  • The points in time are encoded using localized ISO 8601 time string notation.
  • I did not incorporate the numbers on recovered so far because individual Gesundheitsämter do not have the capacity to carefully track this metric yet (it is rather meaningless).
  • Right now my idea is to update this file daily during the (German) evening hours, after ZEIT ONLINE and Berliner Morgenpost have published their last update of the day.
  • As a differentiator from other datasets the sample timestamps contain the time of the day so that consumers can at least have a vague impression if the sample represents the state in the morning or evening (a common confusion about the RKI-derived datasets). If it's the morning then it's likely to actually be data of the day before. If it's the evening then it's more likely to represent the state of the day.

Example: parsing and plotting

import pandas as pd
import matplotlib.pyplot as plt


df = pd.read_csv(
    sys.argv[1],
    index_col=["time_iso8601"],
    parse_dates=["time_iso8601"],
    date_parser=lambda col: pd.to_datetime(col, utc=True),
)
df.index.name = "time"

df["DE-BW_cases"].plot(
    title="DE-BW confirmed cases", marker="x", grid=True, figsize=[12, 9]
)
plt.savefig("bw_cases_over_time.png", dpi=200)

HTTP API details

For the HTTP API some of the motivations are convenience ( easy to consume in the tooling of your choice!), interface stability, and availability.

  • The HTTP API is served under https://covid19-germany.appspot.com
  • It is served by Google App Engine from a European data center
  • The code behind this can be found in the gae directory in this repository.

How to get historical data for a specific German state/Bundesland:

Construct the URL based on this pattern:

https://covid19-germany.appspot.com/timeseries/<state>/<metric>:

For <state> use the ISO 3166 code, for <metric> use cases or deaths.

For example, to fetch the time evolution of the number of confirmed COVID-19 cases for Bayern (Bavaria):

$ curl -s https://covid19-germany.appspot.com/timeseries/DE-BY/cases | jq
{
  "data": [
    {
      "2020-03-10T12:00:00+01:00": "314"
    },
[...]

The points in time are encoded using localized ISO 8601 time string notation. Any decent datetime library can parse that into timezone-aware native timestamp representations.

How to get the current snapshot for all of Germany (no time series):

$ curl -s https://covid19-germany.appspot.com/now | jq
{
  "current_totals": {
    "cases": 12223,
    "deaths": 31,
    "recovered": 99,
    "tested": "unknown"
  },
  "meta": {
    "contact": "Dr. Jan-Philip Gehrcke, [email protected]",
    "source": "ZEIT ONLINE (aggregated data from individual ministries of health in Germany)",
    "time_source_last_consulted_iso8601": "2020-03-19T03:47:01+00:00",
    "time_source_last_updated_iso8601": "2020-03-18T22:11:00+01:00"
  }
}

Notably, the Berliner Morgenpost seems to also do a great job at quickly aggregating the state-level data. This API endpoint chooses either that source or ZEIT ONLINE depending on the higher case count.

Attribution

Shout-out to ZEIT ONLINE for continuously collecting and publishing the state-level data with little delay.

Edit: Notably, by now the Berliner Morgenpost seems to do an equally well job of quickly aggregating the state-level data. We are using that in here, too. Thanks!

Edit March 26: Risklayer is coordinating a crowd-sourcing effort to process verified Landkreis data as quickly as possible. Tagesspiegel is verifying this effort and using it in their overview page. As far as I can tell this is so far the most transparent data flow, and also the fastest, getting us the freshest case count numbers. Great work!

Fast aggregation & communication is important during the phase of exponential growth.

Random notes

  • The MDC Berlin has published this visualization and this article, but they seemingly decided to not publish the time series data. I got my hopes up here at first!

covid-19-germany-gae's People

Contributors

jgehrcke avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.