ETL-LambdaBuilder

CDM Builder leveraging AWS Lambda

Getting Started

The ETL-LambdaBuilder consist of 2 AWS Lambda fuctions and the ETL command line tool:

the ETL command line tool - split source dataset (Redshift) to chunks, move data to S3 bucket and triggers AWS Lambda functions
CDM Builder function - convert native data from S3 bucket to CDM 5.4 format and store result to S3 result bucket as .csv files
Merge function - used for aggregation of the following tables: fact_relationship, metadata.

Prerequisites:

Visual Studio 2022
AWS Toolkit for Visual Studio Toolkit
.NET 8.0 .net

Publish to Lambda

Open org.ohdsi.cdm.sln with Visual Studio
In AWS Explorer setup your AWS Profile, Menu: View -> AWS Explorer. (see, if not installed Toolkit)

Right mouse click and click Publish to AWS lambda... on org.ohdsi.cdm.presentation.lambdabuilder project in the Solution Explorer

Use following settings:

Runtime: .NET8
Architecture: ARM
Function name: CDMBuilder
Handler: org.ohdsi.cdm.presentation.lambdabuilder::org.ohdsi.cdm.presentation.lambdabuilder.Function::FunctionHandler

And in a similar way for the Merge function, org.ohdsi.cdm.presentation.lambdamerge project

Runtime: .NET8
Architecture: ARM
Function name: Merge
Handler: org.ohdsi.cdm.presentation.lambdamerge::org.ohdsi.cdm.presentation.lambdamerge.Function::FunctionHandler

Upload functions to AWS

Configuring CDMBuilder Lambda function

Open AWS Console, Lambda functions page
Open CDMBuilder function
Add s3 trigger

Bucket: !!! Use separate bucket for trigger, all PUT events in this bucket will invoke your function
Event types: s3:ObjectCreated:Put

Setup environment variables

Bucket: bucket for result
CDMFolder: cdmCSV
S3AwsAccessKeyId: AccessKeyId for result bucket
S3AwsSecretAccessKey: SecretAccessKey for result bucket

Setup general configuration, setting depends on source data set, below recommended settings:

Memory 3000MB
Ephemeral storage 2048MB
Timeout 15min

Configuring Merge Lambda function

Open Merge function
Add s3 trigger

Bucket: Use CDMBuilder trigger bucket !!! Use separate bucket for trigger, all PUT events in this bucket will invoke your function
Event types: s3:ObjectCreated:Put
Prefix: merge/

Setup environment variables

Bucket: bucket for result
CDMFolder: cdmCSV
ResultFolder: cdmCSV
S3AwsAccessKeyId: AccessKeyId for result bucket
S3AwsSecretAccessKey: SecretAccessKey for result bucket

Setup general configuration, setting depends on source data set, below recommended settings:

Memory 3000MB
Ephemeral storage 512MB
Timeout 15min

Publish command line tool

Right mouse click on org.ohdsi.cdm.presentation.etl project in Solution Explorer, Publish..

Publish settings, my version of the settings below, different may be used

Run ETL conversion

Open command line tool folder
Update appsettings.json file with yours setting

{
  "AppSettings": {
    "s3_aws_access_key_id": "your result bucket access_key_id",
    "s3_aws_secret_access_key": "your result bucket secret_access_key",
    "bucket": "your result bucket",

    "s3_messages_access_key_id": "your trigger bucket access_key_id",
    "s3_messages_secret_access_key": "your trigger bucket secret_access_key",
    "messages_bucket": "your trigger bucket",
    "messages_bucket_merge": "your trigger bucket/merge",

    "raw_folder": "raw",
    "cdm_folder_csv": "cdmCSV",
    "cdm_folder": "cdmCSV",
    "vendor_settings": "",

    "spectrum_cluster": "",
    "spectrum_db": "",

    "iam_role": "",

    "parallel_queries": "1",
    "parallel_chunks": "5",

    "local_path": ""
  },
  "ConnectionStrings": {
    "Builder": "Data Source=builder.db",
    "Vocabulary": "Connection string to database that contains OHDSI Vocabulary tables (concept, concept_ancestor, source_to_concept_map...)",
    "Source": "Connection string to source database"
  }
}

Connection string example:

Driver={Amazon Redshift (x64)};Server=your server name;Database={db};Sc={sc};UID=your user;PWD=your pswd;Port=5439;SSL=true;Sslmode=Require;UseDeclareFetch=1;Fetch=10000000;UseMultipleStatements=1

Following parameters {db} and {sc} will be replaced to command line parameters, during tool startup

Run org.ohdsi.cdm.presentation.etl with below parameters:

vendor - name of the ETL converter (available converters: ccae, mdcr, mdcd, jmdc, cprd, premier, jmdc, dod, ses, panther)
rawdb - name of the source database
rawschema - name of the source schema
batch - size of the chunk in person numbers, tool will split native data to equivalent chunk, details below
new - true/false, default true, create new folder for conversion result
skip_chunk - true/false, default false, true used in resume mode, skip or not chunk creation step
resume - true/false, default false, in resume mode tool will convert not completed chunks
skip_lookup - true/false, default false, lookup step creates follwoing tables: Provider, Location, Care_site
skip_build false - true/false, default false, true - for testing purpose
skip_etl false - true/false, default false, true - for testing purpose
versionid - CDM version number, optional parameter
skip_vocabulary - true/false, default true, false - will copy vocabulary tables to result bucket in .csv format
skip_cdmsource - true/false, default false, true - creates cdmsource data in s3 result bucket with vendor details
skip_merge - true/false, default false, aggregates fact_relationship and metadata tables
skip_validation - use true, obsolete parameter

Example:

org.ohdsi.cdm.presentation.etl.exe --vendor ccae --rawdb ccae  --rawschema native --batch 500000 --new true --skip_chunk false --resume false --skip_lookup false  --skip_build false --skip_etl false --versionid 1 --skip_vocabulary true --skip_cdmsource false --skip_merge false --skip_validation true

Batch size or Size of the chunk

Lambda function are limited in memory for this reason ETL tool splitting source data to chunks, additionally chunks are divided into smaller part by number of source Redshift cluster slices.

To check number of slices use below query:

select node, slice from stv_slices;

Chunk size depended on source dataset and slice number. Larger size of chunk provides better performance, but can cause Out of memory error in lambda function, so to process chunk you will need to reduce chunk size or increase lambda memory.

The approximate chunk size (for 3000MB CDMBuilder function) can be calculated using this formula:

batch=number of slice * 100k

A couple of examples of conversion time depending on cluster type for IBM CCAE:

ra3.16xlarge, 12 nodes, 24 slices - chunk size = 250000, conversion time approximately 30h
ra3.4xlarge, 6 nodes, 192 slices - chunk size = 2000000, conversion time approximately 4h

ETL tool parameters example:

org.ohdsi.cdm.presentation.etl.exe --vendor ccae --rawdb ibm_ccae --rawschema native --batch 500000 --new true --skip_chunk false --resume false --skip_lookup false  --skip_build false --skip_etl false --versionid 1 --skip_vocabulary true --skip_cdmsource false --skip_merge false --skip_validation true

Lambda function log output

When chunk data will be available on s3, etl tool will trigger function by creating N files in s3 trigger bucket, number of functions/files equivalent number of slices, etl output shouw messages like below:

...
[Moving raw data] Lambda functions for cuhnkId=21 were triggered | 24 functions
...

You can check log of each lambda function with Amazon CloudWatch, CDMBuilder and Merge function will have own log group. File that triggered lambda will be automatically dropped if conversion was successful.

Etl tool output provide information about total number of chunks, like below:

...
***** Chunk ids were generated and saved, total count=36 *****
...

and current progress, for example:

*** 6:17 AM | Running: lambda 7 local 0; Validating 0; Valid 16; Timeout 0; Invalid 0; Error 0; Total 23

Export CDM from s3 to Redshift

Create CDM database dll
Use following template to move data from s3 to Redshift table

copy "schema_name"."cdm_table" 
from 's3://your_result_bucket/vendor_name/conversionId/cdmCSV/cdm_table/cdm_table.' 
credentials 'aws_access_key_id=your_result_bucket_access_key_id;aws_secret_access_key=your_result_bucket_secret_access_key' DELIMITER ',' CSV QUOTE AS '"' GZIP

Important

Be careful with trigger bucket all PUT events in this bucket will invoke lambda function
For current implementation, all native tables that are involved in the transformation must have distribution key on person id columns (ENROLID for ibms, medrec_key for premier, patid for CPRD and etc.)

ohdsi / etl-lambdabuilder Goto Github PK