Code Monkey home page Code Monkey logo

data-pipelines's People

Contributors

davhin avatar jakobkolb avatar mergify[bot] avatar tashintalbot avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

data-pipelines's Issues

Setup linting and code formatting

  • Setup linting and code formatting according to current best practices e.g. flake8 and black.
  • Setup CI for linting and formatting and set it to fail if there are warnings. We want our code base to stay clean.

Add template pipeline and instructions

To make implementing new pipelines easy for everyone, we want

  • and example pipeline
  • instructions on how to adapt it for a given use case in the Readme
  • Instruction on how to write unit and integration tests
  • Instructions on who to ask for reviews.

Publish application docker image to registry to reduce build times

Currently, we rebuild our application image (containing airflow, poetry and all python deps) every time the dependencies change. This takes a lot of time and slows down the dev process.

Proposed solution: Upload the current image (reflecting the state of the application on the main branch) to a docker repository and pull the image for further development from there (thus only installing incremental changes in dependencies)

Evaluate cuttle for converting jupyter notebooks to airflow pipelines

The docs say, that this tool was made to enable people to convert their data wrangling in jupyter notebooks into airflow pipelines and jobs. As this is something that most of our users might want to do, we will evaluate this option and if possible create a dummy workflow as a hands on example.

Fix feature extraction pipeline

The symptoms question in the weekly survey has a new ID since ~2022/01/18. This causes the feature extraction pipeline to not gather answers for this new question resulting in symptoms not being listed in extracted features.

New question ID is 137.

We have to extract symptom data from both, ID 86 (the old one) and ID 137 (the new one) while not breaking downward compatibility with only extracting answers from one question.

Enable the vital data on epoch granularity

We currently do not receive/process the vital data in epoch granularity. This data holds tremendous value and I want to re-add it to our database.

Steps

  • figure out which database technology to use (most likely Yandex Clickhouse since this is what Thryve uses at well)
  • add schema and migration functionality as library to the repo
  • set up DAG for the data import

Fix continuous deployment

After restructuring the repo, the CD pipeline is broken. Fix it and make sure that next time it breaks, it also goes red.

Survey Data Postprocessing

For survey data from the ''Tests and Symptoms" survey, we need a post processed table that contains the following information:

  • user_id
  • question_id
  • question_encoding
  • answers (list of selected options in case of multiple choice)

Datenspende dag schedules don't update.

To resolve the dag interdependency between the survey and the vital data dags in the data donation project, we have to change their schedules. However, the schedules in the production deployment don't update even though the dag parameters are already changed.

Setup database migrations

To have consistent database schemas in all environments, we want to have a single source of truth for them. This can be done via migrations that are tracked in git together with a migration manager that applies them to the databases that we use.

A good tooling option would be yoyo-migrations as it allows for migrations to be written in plain SQL as well as python functions and works well with psycopg2.

Fix docker builds

Docker builds are failing due to a missing signature of the repo of one of its dependencies.

Adding a script that computes 7-day rolling incidences

we usually need case numbers in (i) raw, (ii) 7d-averaged/summed, or (iii) 7d-averaged/summed per 100k. I would like to have this updated as an extra table called incidences in our db.

I've written the following the following script that can be run whenever the case numbers are downloaded and processed. It uses polars v 0.8.9 and computes incidences for every location on every level of abstraction that we have (4 levels):

  • level 0: corresponds to nuts0, i.e. country, in our case only Germany
  • level 1: corresponds to nuts1, german states
  • level 3: corresponds to nuts3, german Landkreise
  • level 4: corresponds to the "germain id", we get from RKI, comprises Landkreise everywhere except for berlin, where berliner bezirke are shown separately

these are the columns with types (for creating the table in the db):

[
             ('location_id',po.UInt16),
             ('location_level',po.UInt8),
             ('date_of_report',Date),
             ('new_cases',po.Int32),
             ('new_cases_last_7d',po.Int32),
             ('incidence_7d_per_100k',po.Float32),
             ('new_deaths',po.Int32),
             ('nuts3',str),
             ('population',po.UInt32),
             ('state',po.UInt8),
]

Add pipeline for Thryve user data

For the datenspende project, we need to collect user account data from thryve as a prerequisite to deal with vital and survey data.

Resolve vitaldata DAG interdependency

This is a workaround (#98) to resolve the issue of DAGs accessing the same table. In the not too distant feature we probably want to restructure the datenspende DAGs

Configure production deployment.

For our production deployment, we need to use the actual production database of the group and probably set some other environment variables that differ from our CI and local testing environments.

Add questionnaire_session to homogenized features

In order to quickly pull additional information, the table datenspende_derivates.homogenized_features should contain a column with the questionnaire_session of the corresponding data in datenspende.answers

Add predictions pipeline

Import predictions model and load baseline and user features data to make infection predictions. Save in predictions table.

Fix incidence calculation

Calculation of incidences is currently failing due to an integer overflow. (see logs here)

This can probably be fixed by either changing the incidence calculation script or changing the db shema to BIGINT for the respective values. Probably @benmaier can advice

Add test type to homogenized features

Add the responses from question 91 to datenspende_derivates.homogenized_features. They contain information about the type (PCR, Antigen, Antibody) of the test taken.

The pipeline template/example should contain documentation in the form of comments

I'm starting to get a hold of how things work. Generally I prefer templates/examples to include inline comments whereever possible, so that every step is explained. Is that sth that could be added?

From what I understand now, the following files are relevant for building my own pipeline:
1. dags/database/migrations/migration_files/20211014_99_move_test_table_to_test_shema.sql (for defining the structure of the table)
2. dags/csv_download_to_postgres/csv_download_to_postgres.py (for doing the actual work)
3. dags/csv_download_to_postgres/test_csv_download_to_postgres.py (for testing parts of the pipeline)
4. dags/csv_download_to_postgres/test_integration_csv_download_to_postgres.py (for testing the entire pipeline)

Which others are necessary to start building pipeline?

Originally posted by @marcwie in #65 (comment)

Add table with features containing reported tests and symptoms

To incorporate test results, symptoms and information about bodies of users (height, weight, sex etc) we should add a table in datenspende_derivatives that contains that data.

I intend to add two separate tables for

  • the results from the "Tests and Symptoms" questionnaire and
  • the results from the weekly questionnaires.

Both should contain the following columns:

  • user_id
  • date (of test - since we only have the week, I'll probably take the first day of the week)
  • test result (boolean true/false/null if no test has been done that week)
  • data on user bodies (weight, height, age, sex)
  • data on reported symptoms

Possibly also vaccination status, but I haven't figured out yet where to get it. So this will probably move to a follow up.

Setup branch protection and PR automation

We want to prohibit pushing to master and introduce new code via feature branches. We require passing CI (and potentially preview CD) as well as one approving review for merging. We want to have a bot that does the merging and cleanup if possible.

Add table for config files

Currently the management system on Ava has a table that contains config files - these are strings in json format with a name like "SuchGreenAnt", which is the current config for the datenspende detections master.

I would like to add this functionality under the new framework, bc. storing and retrieving configs in a central place provides great value imho.

I propose the following

  • adding a database schema in /migrations that sets up the table
  • adding functionality to /database that allows retrieval of json strings by their name
  • adding functionality to add new json strings to the table (this has some unclear aspects as of now)

Fix poetry dependency installs during docker builds

Currently, during docker builds, poetry installs/updates all dependencies, even though they are already present and versions don't change. This makes changing dependencies and rebuilding the docker stack slow and should be fixed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.