The data-pipelines from rocs-org

Setup linting and code formatting

Setup linting and code formatting according to current best practices e.g. flake8 and black.
Setup CI for linting and formatting and set it to fail if there are warnings. We want our code base to stay clean.

Add template pipeline and instructions

To make implementing new pipelines easy for everyone, we want

and example pipeline
instructions on how to adapt it for a given use case in the Readme
Instruction on how to write unit and integration tests
Instructions on who to ask for reviews.

Add new source for cases data

Since the links to the cases data files behind the Arcgis dashboard break without warning from time to time, we are using this now as an opportunity to switch the source of the cases data to https://github.com/robert-koch-institut/SARS-CoV-2_Infektionen_in_Deutschland

There are a couple of DAGs this data gets used in, and the new source does not contain e.g. the Bundesland, Landkreis, Altersgruppe2 (irrelevant) columns, so this will take a bit of digging.

Add example and how to for pipeline developers

To make the infrastructure accessible for people who want to add their pipelines to it, it should feature some copy paste boilerplate examples and instructions on how to use them.

Publish application docker image to registry to reduce build times

Currently, we rebuild our application image (containing airflow, poetry and all python deps) every time the dependencies change. This takes a lot of time and slows down the dev process.

Proposed solution: Upload the current image (reflecting the state of the application on the main branch) to a docker repository and pull the image for further development from there (thus only installing incremental changes in dependencies)

Setup Tests locally and in CI with a Hello World pipeline.

port detections pipeline for monitor to Airflow

The pipeline producing the detections+json for https://corona-datenspende.de/science/monitor/ needs to be ported to an Airflow DAG. Along the way I also want to include a step to drop data from earlier than 2020-04-12, in case that is included in the ingest.

Evaluate cuttle for converting jupyter notebooks to airflow pipelines

The docs say, that this tool was made to enable people to convert their data wrangling in jupyter notebooks into airflow pipelines and jobs. As this is something that most of our users might want to do, we will evaluate this option and if possible create a dummy workflow as a hands on example.

data donations survey dag broken.

Thryve added a new column to the choice table that we don't expect and that beaks the dag.

Fix feature extraction pipeline

The symptoms question in the weekly survey has a new ID since ~2022/01/18. This causes the feature extraction pipeline to not gather answers for this new question resulting in symptoms not being listed in extracted features.

New question ID is 137.

We have to extract symptom data from both, ID 86 (the old one) and ID 137 (the new one) while not breaking downward compatibility with only extracting answers from one question.

Enable the vital data on epoch granularity

We currently do not receive/process the vital data in epoch granularity. This data holds tremendous value and I want to re-add it to our database.

Steps

figure out which database technology to use (most likely Yandex Clickhouse since this is what Thryve uses at well)
add schema and migration functionality as library to the repo
set up DAG for the data import

Hospitalization pipeline broken

Pipeline exits with connection reset by peer.

Fix continuous deployment

After restructuring the repo, the CD pipeline is broken. Fix it and make sure that next time it breaks, it also goes red.

Collecting cases data is failing.

It seems like the format of the files providing cases data has changed, resulting in failures in data collection.

Survey Data Postprocessing

For survey data from the ''Tests and Symptoms" survey, we need a post processed table that contains the following information:

user_id
question_id
question_encoding
answers (list of selected options in case of multiple choice)

Datenspende dag schedules don't update.

To resolve the dag interdependency between the survey and the vital data dags in the data donation project, we have to change their schedules. However, the schedules in the production deployment don't update even though the dag parameters are already changed.

Fix broken pipelines

https://github.com/rocs-org/rocs-scripts

under fixme there are

divi (icu capacity numbers)
tests

both are broken atm

truncate insert only adds last batch

due to how psycopg2 handles insert queries (batch) the table is truncated again at every batch that is executed, see below

https://www.psycopg.org/docs/extras.html#psycopg2.extras.execute_values

data-pipelines/services/airflow/src/lib/dag_helpers/write_dataframe_to_postgres.py

Line 129 in f5d204f

sql.SQL("TRUNCATE TABLE {}.{}; INSERT INTO {}.{} ({}) VALUES %s;").format,

Setup CI for running the docker stack.

Just spin up the docker stack and install dependencies. No tests, no linting etc. Those come later.

Setup database migrations

To have consistent database schemas in all environments, we want to have a single source of truth for them. This can be done via migrations that are tracked in git together with a migration manager that applies them to the databases that we use.

A good tooling option would be yoyo-migrations as it allows for migrations to be written in plain SQL as well as python functions and works well with psycopg2.

Fix docker builds

Docker builds are failing due to a missing signature of the repo of one of its dependencies.

Discover all data pipelines that should be running

where do they run?
what is the code?
how high prio are they to fix?

Migrate covid cases pipeline

https://github.com/rocs-org/rocs-scripts/blob/main/rscripts/scripts/corona_cases.py

Adding a script that computes 7-day rolling incidences

we usually need case numbers in (i) raw, (ii) 7d-averaged/summed, or (iii) 7d-averaged/summed per 100k. I would like to have this updated as an extra table called incidences in our db.

I've written the following the following script that can be run whenever the case numbers are downloaded and processed. It uses polars v 0.8.9 and computes incidences for every location on every level of abstraction that we have (4 levels):

level 0: corresponds to nuts0, i.e. country, in our case only Germany
level 1: corresponds to nuts1, german states
level 3: corresponds to nuts3, german Landkreise
level 4: corresponds to the "germain id", we get from RKI, comprises Landkreise everywhere except for berlin, where berliner bezirke are shown separately

these are the columns with types (for creating the table in the db):

[
             ('location_id',po.UInt16),
             ('location_level',po.UInt8),
             ('date_of_report',Date),
             ('new_cases',po.Int32),
             ('new_cases_last_7d',po.Int32),
             ('incidence_7d_per_100k',po.Float32),
             ('new_deaths',po.Int32),
             ('nuts3',str),
             ('population',po.UInt32),
             ('state',po.UInt8),
]

Add pipeline for Thryve user data

For the datenspende project, we need to collect user account data from thryve as a prerequisite to deal with vital and survey data.

Set up CD for production deployment.

We want to deploy to production each time we tag a release.

Set up unit tests with CI

Delete datenspende users that are not in import

The data export from thrive contains a List of all users that are currently using the app. We should delete all users on our side that are not part of that list.

Implement detections classifier as sklearn.BaselineClassifier subclass

Implement detections classifier implementing the interface of sklearn classifierts in a separate library package.

Resolve vitaldata DAG interdependency

This is a workaround (#98) to resolve the issue of DAGs accessing the same table. In the not too distant feature we probably want to restructure the datenspende DAGs

Add linnting, isort and black as pre commit hooks

To avoid unnecessary CI failures, do code formatting automatically. Also, run linting and reject commit if linting does not pass.

Configure production deployment.

For our production deployment, we need to use the actual production database of the group and probably set some other environment variables that differ from our CI and local testing environments.

Add questionnaire_session to homogenized features

In order to quickly pull additional information, the table datenspende_derivates.homogenized_features should contain a column with the questionnaire_session of the corresponding data in datenspende.answers

Setup test context that prepares the test target database and makes it available for tests.

To run end to end tests for our pipelines, we need to mock our production database. To do this, we have to run all the migrations to set up the DB schemas in CI every time, so that tests find an environment that mirrors the one that we have in production.

Add predictions pipeline

Import predictions model and load baseline and user features data to make infection predictions. Save in predictions table.

Add Baseline calculation task to post processing data donation

This should implement a simplified version of the baseline calculation in here without all the edge cases handling missing or scarce data.

Setup docker-compose file with minimal stack running Airflow

Setup the minimum required stack to run airflow with a target db for testing.

Fix incidence calculation

Calculation of incidences is currently failing due to an integer overflow. (see logs here)

This can probably be fixed by either changing the incidence calculation script or changing the db shema to BIGINT for the respective values. Probably @benmaier can advice

Collect Data for German ZIP Codes

To validate the user input in the datenspende project, we need a List of valid german ZIP codes and their corresponding NUTS3 regions. Those are available at https://gisco-services.ec.europa.eu/tercet/flat-files

Add test type to homogenized features

Add the responses from question 91 to datenspende_derivates.homogenized_features. They contain information about the type (PCR, Antigen, Antibody) of the test taken.

Add pipeline for population data

Calculating infection rates as requested in #21 requires population data. Previously, the population data was calculated like this.

The pipeline template/example should contain documentation in the form of comments

I'm starting to get a hold of how things work. Generally I prefer templates/examples to include inline comments whereever possible, so that every step is explained. Is that sth that could be added?

From what I understand now, the following files are relevant for building my own pipeline:
1. dags/database/migrations/migration_files/20211014_99_move_test_table_to_test_shema.sql (for defining the structure of the table)
2. dags/csv_download_to_postgres/csv_download_to_postgres.py (for doing the actual work)
3. dags/csv_download_to_postgres/test_csv_download_to_postgres.py (for testing parts of the pipeline)
4. dags/csv_download_to_postgres/test_integration_csv_download_to_postgres.py (for testing the entire pipeline)

Which others are necessary to start building pipeline?

Originally posted by @marcwie in #65 (comment)

Add table with features containing reported tests and symptoms

To incorporate test results, symptoms and information about bodies of users (height, weight, sex etc) we should add a table in datenspende_derivatives that contains that data.

I intend to add two separate tables for

the results from the "Tests and Symptoms" questionnaire and
the results from the weekly questionnaires.

Both should contain the following columns:

user_id
date (of test - since we only have the week, I'll probably take the first day of the week)
test result (boolean true/false/null if no test has been done that week)
data on user bodies (weight, height, age, sex)
data on reported symptoms

Possibly also vaccination status, but I haven't figured out yet where to get it. So this will probably move to a follow up.

Report Pipeline Failures to Slack

As raised here, we should notify via Slack when pipelines are failing.
For reference, see here.

Setup branch protection and PR automation

We want to prohibit pushing to master and introduce new code via feature branches. We require passing CI (and potentially preview CD) as well as one approving review for merging. We want to have a bot that does the merging and cleanup if possible.

Add table for config files

Currently the management system on Ava has a table that contains config files - these are strings in json format with a name like "SuchGreenAnt", which is the current config for the datenspende detections master.

I would like to add this functionality under the new framework, bc. storing and retrieving configs in a central place provides great value imho.

I propose the following

adding a database schema in /migrations that sets up the table
adding functionality to /database that allows retrieval of json strings by their name
adding functionality to add new json strings to the table (this has some unclear aspects as of now)

rocs-org / data-pipelines Goto Github PK

data-pipelines's People

Contributors

Stargazers

Watchers

data-pipelines's Issues

Recommend Projects

Recommend Topics

Recommend Org