Code Monkey home page Code Monkey logo

etl-lambdabuilder's Introduction

ETL-LambdaBuilder

CDM Builder leveraging AWS Lambda

Getting Started

The ETL-LambdaBuilder consist of 2 AWS Lambda fuctions and the ETL command line tool:

  • the ETL command line tool - split source dataset (Redshift) to chunks, move data to S3 bucket and triggers AWS Lambda functions
  • CDM Builder function - convert native data from S3 bucket to CDM 5.4 format and store result to S3 result bucket as .csv files
  • Merge function - used for aggregation of the following tables: fact_relationship, metadata.

Prerequisites:

  • Visual Studio 2022
  • AWS Toolkit for Visual Studio Toolkit
  • .NET 8.0 .net

Publish to Lambda

  1. Open org.ohdsi.cdm.sln with Visual Studio
  2. In AWS Explorer setup your AWS Profile, Menu: View -> AWS Explorer. (see, if not installed Toolkit)

image

  1. Right mouse click and click Publish to AWS lambda... on org.ohdsi.cdm.presentation.lambdabuilder project in the Solution Explorer

image

  1. Use following settings:
  • Runtime: .NET8
  • Architecture: ARM
  • Function name: CDMBuilder
  • Handler: org.ohdsi.cdm.presentation.lambdabuilder::org.ohdsi.cdm.presentation.lambdabuilder.Function::FunctionHandler

image

And in a similar way for the Merge function, org.ohdsi.cdm.presentation.lambdamerge project

  • Runtime: .NET8
  • Architecture: ARM
  • Function name: Merge
  • Handler: org.ohdsi.cdm.presentation.lambdamerge::org.ohdsi.cdm.presentation.lambdamerge.Function::FunctionHandler
  1. Upload functions to AWS

Configuring CDMBuilder Lambda function

  1. Open AWS Console, Lambda functions page
  2. Open CDMBuilder function
  3. Add s3 trigger
  • Bucket: !!! Use separate bucket for trigger, all PUT events in this bucket will invoke your function
  • Event types: s3:ObjectCreated:Put

image

  1. Setup environment variables
  • Bucket: bucket for result
  • CDMFolder: cdmCSV
  • S3AwsAccessKeyId: AccessKeyId for result bucket
  • S3AwsSecretAccessKey: SecretAccessKey for result bucket

image

  1. Setup general configuration, setting depends on source data set, below recommended settings:
  • Memory 3000MB
  • Ephemeral storage 2048MB
  • Timeout 15min

image

Configuring Merge Lambda function

  1. Open Merge function

  2. Add s3 trigger

  • Bucket: Use CDMBuilder trigger bucket !!! Use separate bucket for trigger, all PUT events in this bucket will invoke your function
  • Event types: s3:ObjectCreated:Put
  • Prefix: merge/
  1. Setup environment variables
  • Bucket: bucket for result
  • CDMFolder: cdmCSV
  • ResultFolder: cdmCSV
  • S3AwsAccessKeyId: AccessKeyId for result bucket
  • S3AwsSecretAccessKey: SecretAccessKey for result bucket
  1. Setup general configuration, setting depends on source data set, below recommended settings:
  • Memory 3000MB
  • Ephemeral storage 512MB
  • Timeout 15min

Publish command line tool

  1. Right mouse click on org.ohdsi.cdm.presentation.etl project in Solution Explorer, Publish..

image

  1. Publish settings, my version of the settings below, different may be used

image

Run ETL conversion

  1. Open command line tool folder
  2. Update appsettings.json file with yours setting
{
  "AppSettings": {
    "s3_aws_access_key_id": "your result bucket access_key_id",
    "s3_aws_secret_access_key": "your result bucket secret_access_key",
    "bucket": "your result bucket",

    "s3_messages_access_key_id": "your trigger bucket access_key_id",
    "s3_messages_secret_access_key": "your trigger bucket secret_access_key",
    "messages_bucket": "your trigger bucket",
    "messages_bucket_merge": "your trigger bucket/merge",

    "raw_folder": "raw",
    "cdm_folder_csv": "cdmCSV",
    "cdm_folder": "cdmCSV",
    "vendor_settings": "",

    "spectrum_cluster": "",
    "spectrum_db": "",

    "iam_role": "",

    "parallel_queries": "1",
    "parallel_chunks": "5",

    "local_path": ""
  },
  "ConnectionStrings": {
    "Builder": "Data Source=builder.db",
    "Vocabulary": "Connection string to database that contains OHDSI Vocabulary tables (concept, concept_ancestor, source_to_concept_map...)",
    "Source": "Connection string to source database"
  }
}

Connection string example:

Driver={Amazon Redshift (x64)};Server=your server name;Database={db};Sc={sc};UID=your user;PWD=your pswd;Port=5439;SSL=true;Sslmode=Require;UseDeclareFetch=1;Fetch=10000000;UseMultipleStatements=1

Following parameters {db} and {sc} will be replaced to command line parameters, during tool startup

  1. Run org.ohdsi.cdm.presentation.etl with below parameters:
  • vendor - name of the ETL converter (available converters: ccae, mdcr, mdcd, jmdc, cprd, premier, jmdc, dod, ses, panther)
  • rawdb - name of the source database
  • rawschema - name of the source schema
  • batch - size of the chunk in person numbers, tool will split native data to equivalent chunk, details below
  • new - true/false, default true, create new folder for conversion result
  • skip_chunk - true/false, default false, true used in resume mode, skip or not chunk creation step
  • resume - true/false, default false, in resume mode tool will convert not completed chunks
  • skip_lookup - true/false, default false, lookup step creates follwoing tables: Provider, Location, Care_site
  • skip_build false - true/false, default false, true - for testing purpose
  • skip_etl false - true/false, default false, true - for testing purpose
  • versionid - CDM version number, optional parameter
  • skip_vocabulary - true/false, default true, false - will copy vocabulary tables to result bucket in .csv format
  • skip_cdmsource - true/false, default false, true - creates cdmsource data in s3 result bucket with vendor details
  • skip_merge - true/false, default false, aggregates fact_relationship and metadata tables
  • skip_validation - use true, obsolete parameter

Example:

org.ohdsi.cdm.presentation.etl.exe --vendor ccae --rawdb ccae  --rawschema native --batch 500000 --new true --skip_chunk false --resume false --skip_lookup false  --skip_build false --skip_etl false --versionid 1 --skip_vocabulary true --skip_cdmsource false --skip_merge false --skip_validation true

Batch size or Size of the chunk

Lambda function are limited in memory for this reason ETL tool splitting source data to chunks, additionally chunks are divided into smaller part by number of source Redshift cluster slices.

To check number of slices use below query:

select node, slice from stv_slices;

Chunk size depended on source dataset and slice number. Larger size of chunk provides better performance, but can cause Out of memory error in lambda function, so to process chunk you will need to reduce chunk size or increase lambda memory.

The approximate chunk size (for 3000MB CDMBuilder function) can be calculated using this formula:

batch=number of slice * 100k

A couple of examples of conversion time depending on cluster type for IBM CCAE:

  • ra3.16xlarge, 12 nodes, 24 slices - chunk size = 250000, conversion time approximately 30h
  • ra3.4xlarge, 6 nodes, 192 slices - chunk size = 2000000, conversion time approximately 4h

ETL tool parameters example:

org.ohdsi.cdm.presentation.etl.exe --vendor ccae --rawdb ibm_ccae --rawschema native --batch 500000 --new true --skip_chunk false --resume false --skip_lookup false  --skip_build false --skip_etl false --versionid 1 --skip_vocabulary true --skip_cdmsource false --skip_merge false --skip_validation true

Lambda function log output

When chunk data will be available on s3, etl tool will trigger function by creating N files in s3 trigger bucket, number of functions/files equivalent number of slices, etl output shouw messages like below:

...
[Moving raw data] Lambda functions for cuhnkId=21 were triggered | 24 functions
...

You can check log of each lambda function with Amazon CloudWatch, CDMBuilder and Merge function will have own log group. File that triggered lambda will be automatically dropped if conversion was successful.

Etl tool output provide information about total number of chunks, like below:

...
***** Chunk ids were generated and saved, total count=36 *****
...

and current progress, for example:

*** 6:17 AM | Running: lambda 7 local 0; Validating 0; Valid 16; Timeout 0; Invalid 0; Error 0; Total 23

Export CDM from s3 to Redshift

  1. Create CDM database dll
  2. Use following template to move data from s3 to Redshift table
copy "schema_name"."cdm_table" 
from 's3://your_result_bucket/vendor_name/conversionId/cdmCSV/cdm_table/cdm_table.' 
credentials 'aws_access_key_id=your_result_bucket_access_key_id;aws_secret_access_key=your_result_bucket_secret_access_key' DELIMITER ',' CSV QUOTE AS '"' GZIP

Important

  • Be careful with trigger bucket all PUT events in this bucket will invoke lambda function
  • For current implementation, all native tables that are involved in the transformation must have distribution key on person id columns (ENROLID for ibms, medrec_key for premier, patid for CPRD and etc.)

etl-lambdabuilder's People

Contributors

adelmestri avatar angujo avatar anthonymolinaro avatar bradanton avatar burrowse avatar clairblacketer avatar dependabot[bot] avatar dimshitc avatar ericavoss avatar fdefalco avatar ganisimov avatar navirter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

etl-lambdabuilder's Issues

Can not locally run this app after successfully compiled

I successfully compiled the project and from the

C:\Users\josecz\source\repos\ETL-LambdaBuilder\sources\Presentation\org.ohdsi.cdm.presentation.buildingmanager2\bin\Debug

folder I run

.\org.ohdsi.cdm.presentation.buildingmanager2.exe

After entering my needed db values I started the building process but I got the next error message:

Builder: HSC-CTSC-ETL
Time: Nov 12 2020 10:49AM
Error: Exception Found:
Type: System.ComponentModel.Win32Exception
Message: The system cannot find the file specified
Source: System
Stacktrace: at System.Diagnostics.Process.StartWithShellExecuteEx(ProcessStartInfo startInfo)
at System.Diagnostics.Process.Start()
at System.Diagnostics.Process.Start(ProcessStartInfo startInfo)
at org.ohdsi.cdm.framework.desktop.Controllers.BuilderController.<>c__DisplayClass27_0.b__0() in C:\Users\josecz\source\repos\ETL-LambdaBuilder\sources\Framework\org.ohdsi.cdm.framework.desktop\Controllers\BuilderController.cs:line 211
at org.ohdsi.cdm.framework.desktop.Controllers.BuilderController.PerformAction(Action act) in C:\Users\josecz\source\repos\ETL-LambdaBuilder\sources\Framework\org.ohdsi.cdm.framework.desktop\Controllers\BuilderController.cs:line 68

Does any one know how to solve this error?

Thanks

Steps to run out of AWS

I will like -and need- to run this ETL-LambdaBuilder app out of AWS. Can you please suggest the steps to do so?
Thanks
Jose

Ethnicity in OptumSES should be corrected to be 'unknown'=0 if not Hispanic

Optum SES provides one 'RACE_CODE' field which combined race and ethnicity information. If this value is, White, Black or Asian, then it should populate RACE_CONCEPT_ID accordingly and set ETHNICITY_CONCEPT_ID = 0. If RACE_CODE value is 'Hispanic, then RACE_CONCEPT_ID = 0 and ETHNICITY_CONCEPT_ID = 38003562 (as done now). The change: we should not set ethnicity = 'Not Hispanic of Latino' if White/Black/Asian values are provided.

Invalid column name 'payersource'...

Hello, I am trying to upgrade to this from the old etl-cdm builder since I need v5.3. I am able to run the old software on my current data. However when I try to run this software I receive an error during the 'conversion to cdm' step. It has type: 'System.Data.SqlClientSqlException', message: 'Invalid column name 'payersource', invalid column name 'observation_concept_id', etc. I searched through all the files in the software and found that this error is likely coming from one of two files, either ./executable/Core/Transformation/Truven/Definitons/ENROLLMENT_DETAIL.xml or ./sources/Framework/org.ohdsi.cdm.framework.common2/Core/Transformation/Truven/Definitions/ENROLLMENT_DETAIL.xml. In these files there are queries that seem to imply that my enrollment detail table should have a payersource, observation_concept_id, etc columns which it does not have. I have checked the source sas7bdat Truven files and these columns are not present in the enrollment detail file either. Am I missing a pre-processing step? Any ideas about how to fix this issue? Let me know if you need anymore details!

Lookup and Lookuptype tables

While creating test and additional tables CPRD_Test_Setup.sql and CPRD_Additional_Setup.sql queries are being which contains lookup and lookuptype tables on which joins are happening.

how are these tables being created and how are the tables getting populated?

eGFR (Estimated glomerular filtration rate) mapped to Biomarker EGFR

Hello,

It looks like the lab test for eGFR (note the capitalisation - expanded to Estimated glomerular filtration rate) is mapped to Epidermal Growth Factor Receptor (EGFR; concept_id 3040084) in the lab names file (https://github.com/OHDSI/ETL-LambdaBuilder/blob/master/docs/OPTUM_PANTHER/Vocab%20Updates/JNJ_OPTUM_EHR_LABNAM.txt) as well as the NLPM file (https://github.com/OHDSI/ETL-LambdaBuilder/blob/master/docs/OPTUM_PANTHER/Vocab%20Updates/JNJ_OPTUM_EHR_NLPM.txt). This presents a problem of overwriting, and all eGFR tests need a separate one.

There also is a problem that multiple different Estimated glomerular filtration rates exist (mdrd, mayo, african-american, non-african-american), and while the non-african-american is mapped maybe correctly (concept: 3049187) in the NLPM file, all the other ones are mapped to the wrong one as above.
Additionally, they also need to be correctly moved to the Labs section.
We can hopefully resolve these concept_ids.

Regards
Aditya

CPRD STEM Logic

While creating STEM table from clinical, referral, immunisation, test, additional, therapy. Logic for few Columns in test, additional, therapy is not a available for clinical, referral and immunisation stem tables (example: operator_concept_id, range_high, range_low, unit_concept_id, unit_concept_id, unit__source_value, start_date, end_date, value_as_number, value_as_string, value_as_concept_id. value_source_value).

Is there any join logic for unifying all the tables to a single STEM table?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.