Code Monkey home page Code Monkey logo

rabbit-in-a-blender's Introduction

Rabbit in a Blender

Rabbit in a Blender (RiaB) is an ETL pipeline CLI to transform your EMR data to OMOP.

Why the name 'Rabbit in a Blender'? It stays in the rabbit theme of the OHDSI tools, and an ETL pipeline is like putting all your data in a blender.

No rabbits were harmed during the development of this tool!

Introduction

Extract-Transform-Load (ETL) processes are very complex and are mainly crafted by highly skilled data engineers. The process of transforming the electronic medical record (EMR) data into the observational medical outcomes partnership (OMOP) common data model (CDM) is no exception. The mapping process of the source values to standard concepts is mostly done by subject matter experts, who lack the knowledge of programming the ETL process. Wouldn’t it be nice if we could drastically simplify the ETL process, so that you don’t need seasoned data engineers to start the OMOP CDM journey. Imagine that you just save your queries, Usagi comma separated value (CSV) text files and custom concept CSV’s on disk, and run a command line interface (CLI) tool that does all the ETL magic automatically.

Currently, our ETL process is supported by BigQuery and SQL Server (both on-premise and in Azure). We're open to exploring additional database technologies to serve as backends for our ETL process.

Concept

The main strength of the CDM is its simplified scheme. This scheme is a relational data model, where each table has a primary key and can have foreign keys to other tables. Because of the relational data model, we can extract the dependencies of the tables from the scheme. For example, the provider table is dependent on the care_site table, which is in its turn dependent on the location table. If we flatten that dependency graph, we have a sequence of ETL steps that we need to follow to have consistent data in our OMOP CDM.

These ETL steps can be automated, so a hospital can focus its resources on the queries and the mapping of the concepts. The automated ETL consists of multiple tasks. It needs to execute queries, add custom concepts, apply the Usagi source to concept mapping, and do a lot of housekeeping. An example of that housekeeping is the autonumbering of the OMOP CDM primary keys, for which the ETL process needs to maintain a swap table that holds the key of the source table and the generated sequential number of the CDM table’s primary key. Another example of the housekeeping is the upload and processing of the Usagi CSV’s and also the upload and parsing of the custom concept CSV’s.

In an ETL process data is divided in zones (cfr. the zones in a data lake). The raw zone holds the source data (for example the data from the EMR), the work zone holds all the house keeping tables of the ETL process and the gold zone holds our final OMOP CDM.

After designing the architecture, the implementation needs to be developed. We have two options to choose from: configuration and convention as design paradigm. We choose convention over configuration, because it decreases the number of decisions the user has to make and eliminates the complexity. As convention a specific folder structure is adopted (see our mappings as example). A folder is created for each OMOP CDM table, where the SQL queries are stored to fill up the specific CDM table. In the table folders we also have for each concept column a sub folder. Those concept column sub folders hold our Usagi CSV’s (files ending with _usagi.csv). We also have a custom folder in the concept column sub folder, that holds the custom concept CSV’s (files ending with _concept.csv). With this convention in place, our ETL CLI tool has everything it needs to do its magic.

 

image

One final requirement we want to build in the ETL CLI tool, is that each ETL step is an atomic operation, it either fails or succeeds, so that there is no possibility to corrupt the final OMOP CDM data.

ETL flow

The ETL flow is like a two-stage rocket. You have a first stage and a second stage in the ETL process.

First stage:

Most CDM tables have foreign keys (FKs) to other tables. Some tables can be processed in parallel by the ETL engine, because they have no FKs dependencies between them, others have to be processed in a specific order.

The ETL flow for v5.4 is as follows:

└──vocabulary
  ├──cdm_source
  ├──metadata
  ├──cost
  ├──fact_relationship
  └──location
    └──care_site
      └──provider
        └──person
          ├──condition_era
          ├──death
          ├──dose_era
          ├──drug_era
          ├──episode
          ├──observation_period
          ├──payer_plan_period
          ├──specimen
          └──visit_occurrence
            ├──episode_event
            └──visit_detail
              ├──condition_occurrence
              ├──device_exposure
              ├──drug_exposure
              ├──measurement
              ├──note
              ├──observation
              └──procedure_occurrence
                └──note_nlp

Second stage: will process all the event foreign key columns (e.g. observation_event_id, cost_event_id, measurement_event_id, etc.). Because those columns can point to almost any table, and the auto generated _id's of those table, are only available after the first stage.

RiaB also performs like a rocket especially in combination with BigQuery. A full ETL resulting in 8 billion CDM records takes about 20 min.

Getting started

see getting started

CLI Usage

see CLI commands

Authors

License

Copyright © 2024, RADar-AZDelta. Released under the GNU General Public License v3.0.


rabbit-in-a-blender's People

Contributors

kdenturc avatar lbertelo01 avatar pjlammertyn avatar stijndupulthys avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

axelv jenscardon

rabbit-in-a-blender's Issues

Do not derive from CDM version 5.4

REVERT Deviations from v5.4

Table Change Reason
fact_relationship Renamed domain_concept_id_1 and _2 to field_concept_id_1 and _2 To stay consistent with similar fields like episode_event_field_concept_id. This follows Proposal 1 in the open issue #230.
cost Changed cost_domain_id with type STRING to cost_field_concept_id of type INT64 To stay consistent with similar fields like episode_event_field_concept_id. This follows the changes made in v6.0
source_id_to_omop_id_map Additional table The table holds de mapping between the source id's and the generated OMOP id's

Update documentation --cleanup function

From the docs, running the --cleanup function will remove all data from OMOP tables.

A as user I hesitated to run this command as I thought this might remove all uploaded vocabulary table data; which it doesn't.

Please update the description of the function.

image

RIAB: MEASUREMENT riab run succesful but no measurement_concept_id's in OMOP measurement table or in measurement_id_swap in work table

@pjlammertyn riab --run-etl . -t measurement works and is succesfull, only labo.sql.jinja and labo_usagi.csv tested
image

but then in GCP there is no measurement_concept_id <>0
image

although in work we see a good upload
image

the measurement concept_id_usagi's are appearing
image

But the measurement_id_swap is not giving measurement_concept_id's, howcome????
image

@sieldep @lbertelo01 please check if you don't have this appearing

Check for non-standard codes in usagi files

Would it be possible to add a check in riab, so that it gives an error or warning when non-standard concepts were chosen?

We noticed that by updating the vocabularies to the most recent version, some concepts were changed to non-standard. It would be useful if riab would check for these codes.

RIAB: Performance RiAB & cost

@pjlammertyn running RiAB on measurement table alone has taken runtime approximate 15 to 20 minutes and at the end the cost went up to 10EUR, so today I've spent over 100EUR.
How can we optimize?

If I took the labo_usagi_csv file and diminished the rows from 6500 till 70, then runtime went to 3 minutes and cost was 0.91EUR
which is funner to work with.

Would it be an idea for performance and cost that we delete unchecked and flagged mappings? how can we do so?

tx!
please follow @sieldep

Add check for duplicate custom concepts

example query for check:

select T.concept_code, count(*)
from (
    SELECT DISTINCT swap.y as concept_id, t.* EXCEPT(concept_id)
    FROM `work.observation__observation_concept_id_concept` t
    INNER JOIN `work.concept_id_swap` swap 
        on swap.x = concat('observation_concept_id__', t.concept_code)
) T
group by T.concept_code
having count(*) > 1

SQL Server: Truncation error with multiple mappings to same concept

Mapping multiple source_values to the same omop concept_id could result in data truncation error in SQL Server.

e.g.:

Mapping multiple measurements to same concept results in error:
"String or binary data would be truncated in table 'FHIN.work.measurement', column 'measurement_source_value'."

The column ('measurement_source_value') in "work-zone" table is filled with "STRING_AGG(t.measurement_source_value, ', ')". Column size of 'measurement_source_value' is limited and aggregating multiple source_values exceeds this limit.

issue with deleting tables

When running riab --cleanup or riab --import-vocabularies ..., tables get deleted. But there seems to be a bug in deleting a table:

image image

Deviation from OMOP CDM 5.4

Request to change the length of source_value columns from 50 chars to 255; deviating from OMOP CDM 4.5.

As a consequence, working tables should be able to represent primary key and concept_id mappings of length > 50.

RIAB: invalid concept domains condition_occurence

After running newest version of RiAB on rebased dev and command: riab --run-etl . -t condition_occurrence
there was following error
image
after checking the mzg_auto_usagi.csv file, no other domain than condition was appearing, so this is strange
image

Could it be a capital issue Condition instead of condition???

@lbertelo01 @sieldep tx for following

Suggestion for small change in table order of ETL

image

The vocabulary table should be one of the first tables to run, because if a table uses custom concepts, it checks if the vocabulary used (in our case AZDELTA) is present in the vocabulary table. With the order used now, we get a problem with the custom concepts in the fact_relationship table.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.