ohdsi / etl-cms Goto Github PK

View Code? Open in Web Editor NEW

94.0 57.0 52.0 23.21 MB

Workproducts to ETL CMS datasets into OMOP Common Data Model

License: Apache License 2.0

Python 99.60% Ruby 0.40%

etl-cms's People

Stargazers

Watchers

Forkers

claire-oi aguynamedryan chendaniely donohara pnorton qiaofengxu sirpoovey nitinaggarwal1986 linearregression baberryyoung njgraham pdaicode cukarthik fongcj saadjanjua indera dkoher1 mkrivitskiy guiguiw mustafaascha stillbrave vincent1756 kandalva yangcheny xpf100 joaorafaelalmeida ningyifan joshransom kinyusui ablack3 nagu4dwh linewalks golepandey rohankhera jediwarpraptor sherylgit lenamax2355 cphi-tvhs aaalgo alanyappie sushant097 qualitymeasurement farazahma thecedarprince entityrisk byeungchun tsgallen vdeeplearn bjyen ahmedyarub dhrayco

etl-cms's Issues

mapping for CONDITION_TYPE_CONCEPT_ID not complete

A CMS Inpatient_Claims data record has a ADMTNG_ICD9_DGNS_CD field and also ICD9_DGNS_CD_1 – ICD9_DGNS_CD_10, ICD9_PRCDR_CD_1 – ICD9_PRCDR_CD_6.

A concept in the vocabulary exists for Primary Diagnosic: 38000199òInpatient header - primaryò1òCondition Occurrence Type but it is not used in the ETL. All inpatient records are hardcoded to a CONDITION_TYPE_CONCEPT_ID of 38000200 which is Inpatient header - 1st position and similar 38000230 for outpatient. This happens also in the procedure_occurrence OMOP table.

I believe it means that data is lost in the translation since there are 17 possible codes coming from the Medicare data but the data is mapped into only 4.

In my case, I am trying to replicate a study called Surgeon Scorecard against an OMOP data format that used the Medicare CMS ADMTNG_ICD9_DGNS_CD field as a way to find particular surgeries and calculate a Surgeon complication rate.

Here are the counts of the concept_id's in the converted SynPuf data.

CONDITION_TYPE_CONCEPT_ID,COUNT,DESCRIPTION
38000230,280864910,Outpatient header - 1st position
38000200,8317475,Inpatient header - 1st position

PROCEDURE_TYPE_CONCEPT_ID,COUNT,DESCRIPTION
38000269,275176949,Outpatient header - 1st position
38000251,3592580,Inpatient header - 1st position

PEP8 Python Conventions

Suggest to slowly fix things to follow pep8 conventions as we slowly go spelunking.
We can worry about variable names being in all caps until we have the unit tests from #12 all created.

ETL Visit Occurrence

visit_occurrences aren't yet generated by the ETL.

This is the most complicated table to ETL. Employees of @outcomesinsights (@claire-oi, @marc-outins, @jenniferduryea) are standing by to answer questions about this part of the ETL should you have any.

srt-files not found

I run script from unm-improvements branch and found error

Script requires ".csv.srt" files. What is it? I disabled it in this row.

location_id is null for all care_site records

Hello,

I download the synpuf dataset and noticed that there is something off with the care_site table records so I'd like to verify ( ftp://ftp.ohdsi.org/synpuf/care_site.csv.gz).

Is it expected that location_id is null for all care_site records?

Is there a way to associate each of the care_site records to specific location_id(s)?

Thanks,
Laurence

Incorrect URL to one source Synpuf file

On the CMS website, the 2010 Beneficiary Summary file of sample 1 has an incorrect name. It is called DE1_0_2010_Beneficiary_Summary_File_Sample_20.zip, but the Python script expects it to be called DE1_0_2010_Beneficiary_Summary_File_Sample_1.zip (which would make more sense since it is sample number 1, not 20).

Because of this, the Python script throws an error when trying to download the first sample.

A simple workaround is:

Use the script to download the first sample, and watch it throw the error at the last file.
Manually download and unzip the missing file.
Use the script to download the other 19 samples without issue.

It would be nice if the script was modified to deal with this, but I suspect not many people will be working on this repo anymore.

Column mismatch in create_CDMv5_tables.sql and load_CDMv5_vocabulary.sql

File : https://github.com/OHDSI/ETL-CMS/blob/master/SQL/create_CDMv5_tables.sql
CREATE TABLE IF NOT EXISTS synpuf5.drug_strength (
drug_concept_id INTEGER NOT NULL,
ingredient_concept_id INTEGER NOT NULL,
amount_value NUMERIC NULL,
amount_unit_concept_id INTEGER NULL,
numerator_value NUMERIC NULL,
numerator_unit_concept_id INTEGER NULL,
denominator_value NUMERIC NULL,
denominator_unit_concept_id INTEGER NULL,
valid_start_date DATE NOT NULL,
valid_end_date DATE NOT NULL,
invalid_reason VARCHAR(1) NULL
)
;
File : https://github.com/OHDSI/ETL-CMS/blob/master/SQL/load_CDMv5_vocabulary.sql
The CSV we get from http://www.ohdsi.org/web/athena/ contains additional column as 'BOX_SIZE' in drug_strength.csv

Error on Step 5 of Python ETL process

When I run python CMS_SynPuf_ETL_CDM_v5.py 0 on step 5 "Test ETL with DE_0 CMS test data" I get the following traceback

  File "CMS_SynPuf_ETL_CDM_v5.py", line 7, in <module>
    from constants import OMOP_CONSTANTS, OMOP_MAPPING_RECORD, BENEFICIARY_SUMMARY_RECORD, OMOP_CONCEPT_RECORD, OMOP_CONCEPT_RELATIONSHIP_RECORD
ImportError: cannot import name OMOP_CONSTANTS

Is this a problem with the env file?

Exception while running Step 3 java -jar cpt4.jar 5

After downloading the vocabulary files, I am running the command java -jar cpt4.jar 5 to append the CPT vocabulary. But I keep getting the following exception:

javax.xml.ws.WebServiceException: Failed to access the WSDL at: https://uts-ws.nlm.nih.gov/services/nwsSecurity?wsdl. It failed with:
    java.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty.
    at com.sun.xml.ws.wsdl.parser.RuntimeWSDLParser.tryWithMex(RuntimeWSDLParser.java:251)
    at com.sun.xml.ws.wsdl.parser.RuntimeWSDLParser.parse(RuntimeWSDLParser.java:228)
    at com.sun.xml.ws.wsdl.parser.RuntimeWSDLParser.parse(RuntimeWSDLParser.java:191)
    at com.sun.xml.ws.wsdl.parser.RuntimeWSDLParser.parse(RuntimeWSDLParser.java:160)
    at com.sun.xml.ws.client.WSServiceDelegate.parseWSDL(WSServiceDelegate.java:307)
    at com.sun.xml.ws.client.WSServiceDelegate.<init>(WSServiceDelegate.java:269)
    at com.sun.xml.ws.client.WSServiceDelegate.<init>(WSServiceDelegate.java:205)
    at com.sun.xml.ws.client.WSServiceDelegate.<init>(WSServiceDelegate.java:195)
    at com.sun.xml.ws.spi.ProviderImpl.createServiceDelegate(ProviderImpl.java:112)
    at javax.xml.ws.Service.<init>(Service.java:77)
...

Also, tried to register and get licensed on UMLS (by following this). But even I provided the username and password, I still have the same issue.

Any idea of what's going on? How to solve the issue?

CMS RIF to OMOP ETL?

This doesn't support RIF, does it? Is there a repository nearby that does? I don't see one.

context: I'm working on CMS RIF to i2b2 (https://github.com/kumc-bmi/grouse) and considering going via OMOP.

cc @pbr6cornell @njgraham @rwaitman @mprittie

.env.example files missing required entry for BASE_ETL_CONTROL_DIRECTORY

The .env.example files are missing the required entry for BASE_ETL_CONTROL_DIRECTORY. This causes the etl to fail until it has been added

Create unit tests

The original author suggested that we should add some unit tests to the ETL. Assuming that we don't find #11 to be an efficient/sufficient means of testing the ETL, I agree we should have some unit tests.

Carrier_Claims not found

I run script from unm-improvements branch and found error

Script looks for "DE1_0_2008_to_2010_Carrier_Claims_Sample_4.csv" file. ( not for DE1_0_2008_to_2010_Carrier_Claims_Sample_4A.csv or DE1_0_2008_to_2010_Carrier_Claims_Sample_4B.csv) and didn't found it.

I think it here

CMS ACO CCLF --> OMOP V5

Version_18_CCLF_IP_Eff_April_2017.pdf

ACO CCLF Information Packet Jan 2017, effective April 2017 - to support ETL work to OMOP v5

'issue' posted as new topic in the forum with reference to this git issue

Observation Periods are broken

The ETL is generating observation periods, but according to the original author, they seem to be incorrect. I have no further details.

Updating to newer CDM versions + upload R script available

I wrote an R script for upload the Synpuf data into CDM v5.2.2. There are some new conventions for CDMs that the ETL script doesn't yet follow. My script fixes those, but it would be nicer if they were moved into the ETL script. These are the issues I found:

Per-domain cost tables should be merged in to a single cost table. My script currently just ignores the cost tables
All DATE fields should also get a DATE_TIME field, by just copying the date, and setting time to midnight
The DRUG_EXPOSURE_END_DATE field must be populated. This can simply be done by adding the DAYS_SUPPLY to the DRUG_EXPOSURE_START_DATE.

I guess 1 is harder, but 2 and 3 are pretty trivial.

For even newer versions I think we need to start using the new TYPE_CONCEPT_ID and STATUS_CONCEPT_ID codes and fields.

omop_vocab_xref_0723.txt

Hello!
Where I can find omop_vocab_xref_0723.txt?. Here is a refenrence

some case sensitive command issue in python-etl README file

The command CMS_SynPuf_ETL_CDM_v5.py was CMS_SynPUF_ETL_CDM_v5.py in the python-etl README file which caused some confusion when trying to cut and paste commands from it.

Drug-cost raw data has an embedded newline in the TOT_RX_CST_AMT field

According to the original author, this is an issue that needs to be addressed in the ETL. I have no further details.

SynPuf conversion using icd10 codes

In 2015, Medicare switched over to using icd10 coding though a new SynPUF has not been provided. Is there value in having a data set that uses icd10 codes instead of icd9?

With some code I have written, I could pretty easily convert the omop Synpuf data from icd9 to icd10 and then shift the dates. icd9 to icd10 can be one to many so the code could select a random icd10 code from the group.

Would there be interest in this?

I can see the need for 3 datasets:
-icd9 code dataset for pre 2015 data (direct convertion of SynPUF)
-icd10 code dataset for post 2015 data
-data the includes mixed icd9 and icd10 data based on the years the Medicare switched.

Goal is to allow researchers to test their code prior to getting the real data.

Use idiomatic logging for Python

The original author used print statements to log the progress of the ETL and is hoping that someone else can replace that approach with whatever Python uses as a standard for generating logs.

Python dependency

Inside CMS VRDC, we are not able to run Python. We initiated a project that ports this ETL to pure SQL that can run in SAS (SAS SQL flavor :-( )

ETL Measurements

measurements aren't yet generated by the ETL.

Downloading final data

Hi,

I tried to download the processed data from your FTP link and the Google drive but neither of them are working. Is there any other option to get the data ?

Thanks.

Paul L

CDM_SOURCE.VOCABULARY_VERSION

I think I see a copy of the SynPUF out on Postgres Public. If this is truly a build then the CDM_SOURCE.VOCABULARY_VERSION would be more informative if you pulled this information off the Vocab you ran off of:

SELECT VOCABULARY_VERSION 
FROM vocabulary 
WHERE VOCABULARY_ID = 'None';

This way you know exactly which Vocab was used in the build.

Issue while running python code to get the claims data

Hi All,

I am trying to run the Python code to get the claims data directly from the website that is available on the site i.e get_synpuf_files.py. but I am getting the syntax error. could you please let me know the correct steps to run this successfully?

Thanks,
Ratan

FTP server down?

README.md says "The processed data can be retrieved from ftp://ftp.ohdsi.org/synpuf"

It appears ftp://ftp.ohdsi.org/synpuf is inaccessible; Is the FTP server down?

Person ID as integer, CMS's synthetic data has string IDs

In CMS synthetic data, the Beneficiary ID, 'DESYNPUF_ID', is of type String (technically hexadecimal).

When using Spark or other big data engines, it is generally easier to use this existing unique ID than to compute our own incremental ID, as that would require computing all IDs on a single node in the cluster (each id needs the value of the previous one). That would not allow parallelization of the processing.

Is there any particular reason for making IDs integer, instead of string?

Thanks!

Memory issue of steps 5 in ETL Script

Hi All,

I am executing the step 5 python ETL script and getting the below error.

when I tested the script with test data, it executed successfully but when I am running this on cSynpuf claims data I am having this issue.

I am running this on Intel i7-6600U CPU 2.81 GHz processor and 12 GB RAM. additionally, i have more than 150 GB space free on the drive.

Please let me know if someone has come across this issue while their implementation process and steps to resolve this.

Thanks in advance for your help.

Thanks,
Ratan

CDMv5.2 is not accessible

I am trying to download the downloadable OMOP CDMv5.2 version; however, the file seems not available to the public. Please modify the permissions to allow for public access

https://drive.google.com/file/d/1xWmuVqlIaUsY08OgrKIt8WAsfaq_iCrG/view?usp=sharing

md5 sums don't match files - https://github.com/OHDSI/ETL-CMS/blob/master/python_etl/README.md

I'm working on setting up a local copy of the OMOP SynPUF for testing and noticed that the MD5s show on the readme at https://github.com/OHDSI/ETL-CMS/blob/master/python_etl/README.md don't match the md5 of the files. This file on the ftp site matches what I get locally:

ftp://ftp.ohdsi.org/synpuf/1.0.1/md5sums.txt

The specific mismatches are (shown with what appears to be the correct md5 hash):

da0e310e7313e7b4894d1b1a15aee674 synpuf_1.zip
c1529ea8b4e4e092d0cdd2600ea61c75 visit_occurrence.csv.gz

The readme appears to have the old md5sums from ftp://ftp.ohdsi.org/synpuf/1.0.0/md5sums.txt for those two files.

Letters in condition_source_concept_id

I run script from unm-improvements branch and found error

Condition_occurrence table has int condition_source_concept_id field. But script generates string value in this position (value has letters) of result csv-file. Source csv-file has no letters in condition_source_concept_id position

Make ETL run under Python 3.X

The original ETL was written with a few lines of code that only run under Python 2.x (mostly print statements that should be eliminated in #8).

In my mind, the ETL should be running under Python 3.x, seeing as 3.x has gained relatively widespread adoption at this point and I don't believe there are any library-dependencies that require us to stick with 2.x

Test the output of the ETL

We should gather a handful of rows from each source file and hand-convert those rows into CDMv5, then feed those rows through our ETL and verify it generates the exact same rows. This will serve to test that our ETL process is properly functioning.

As we hit odd rows in the raw data, we can add them to our set of test rows and verify that we're covering all the weird edge-cases we discover during the implementation of the ETL.

My experiences using the unm-improvements branch - Apr 14

Hey folks,

First off, thank you for this resource connecting the synthetic dataset to OMOP, this is very helpful for me to evaluate how OMOP can benefit my work. This has saved me a ton of time.

Below is some unsolicited feedback after using the unm-improvements branch to generate sample patient data for my local OMOP CDM instance. I was referred to here from this discussion.

The get_synpuf_files.py utility was confusing. The README was correct, so doing python output 4 20 worked, but the feedback from the tool was telling me otherwise: output was the INPUT_DIRECTORY, 4 was the OUTPUT_DIRECTORY, and 20 was the SAMPLE_RANGE.
It looks like the get_synpuf_files.py is written in python3, but CMS_SynPuf_ETL_CDM_v5.py is python2. It seems like you folks are thinking about which to use, but to me, consistency within a single repo is the most important trait.
I wasn't sure where to find the omop_vocab_xref_0723.txt, so I ended up commenting out the section that builds the mapping xref.
The concepts need to be loaded into the OMOP CDM v5 database before the CSVs can be loaded.

I'm sure there's lots of internal discussion over on your end, but I would suggest possibly the following to make this really useful to the general public:

Try to make a single script that encapsulates the steps of the README. This process is pretty close to being done!
Release the results of running the ETL-CMS in a zip file on a publicly-accessible place. I could imagine publishing this on a regular basis, as the process and quality improves. It makes it much easier to have a conversation about quality, too.

Thanks again! Super helpful.

KeyError: 'Place of Service'

I downloaded the SNOMED, ICD9CM, ICD9Proc, CPT4, HCPCS, LOINC, RxNorm, and NDC CDMv5 Vocabularies from Athena following the instructions in /python_etl/README.md (step 3). When I run a test of the ETL process using python CMS_SynPuf_ETL_CDM_v5.py 0, I get the following error after awhile:

Done, omop concept recs_in            = 35135870
recs_skipped                          = 31294019
len source_code_concept_dict           = 0
Reading omop_concept_file        -> /Volumes/hadrianus/Data/OHDSI/OMOP/02_Vocabulary/Standard_Vocabulary_v5/CONCEPT.csv
Writing to log file              -> /Volumes/hadrianus/Data/Evidera/98_output/concept_debug_log.txt
loaded domain dict with this many records:  15639
Traceback (most recent call last):
  File "CMS_SynPuf_ETL_CDM_v5.py", line 2028, in <module>
    build_maps()
  File "CMS_SynPuf_ETL_CDM_v5.py", line 464, in build_maps
    destination_file = domain_destination_file_list[domain_id]
KeyError: 'Place of Service'

Below is the referenced concept_debug_log.txt
concept_debug_log.txt

Update ETL logic to support CDM 6.X

CDM 6.x has introduced some key changes to the omop CDM, that will require updating the ETL logic. New changes includes changes to cost table, payer_plan_period, visit_detail, location, location_history.

It would be nice to work together to create a new version of the ETL logic

columns mismatched in location table

In the location csv file, the data for the zip_code is in the county column position. The zip code column position is null.

for example,

location_id,address_1,address_2,city,state,zip,county,location_source_value
1,,,,MO,,269500,26-95
2,,,,PA,,39230,39-230

ETL Locations

locations aren't yet generated by the ETL

Synpuf Load for SQL Server

Here's a script to load the synpuf data into SQL Server if you'd like to add it to the project.

TRUNCATE TABLE [care_site];
BULK INSERT [care_site]
FROM 'C:\synpuf_1\care_site_1.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '0x0a',
ERRORFILE = 'C:\synpuf_1\care_site_1.bad',
TABLOCK
);

TRUNCATE TABLE [condition_occurrence];
BULK INSERT [condition_occurrence]
FROM 'C:\synpuf_1\condition_occurrence_1.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '0x0a',
ERRORFILE = 'C:\synpuf_1\condition_occurrence_1.bad',
TABLOCK
);

TRUNCATE TABLE [death];
BULK INSERT [death]
FROM 'C:\synpuf_1\death_1.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '0x0a',
ERRORFILE = 'C:\synpuf_1\death_1.bad',
TABLOCK
);

TRUNCATE TABLE [device_exposure];
BULK INSERT [device_exposure]
FROM 'C:\synpuf_1\device_exposure_1.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '0x0a',
ERRORFILE = 'C:\synpuf_1\device_exposure_1.bad',
TABLOCK
);

TRUNCATE TABLE [drug_exposure];
BULK INSERT [drug_exposure]
FROM 'C:\synpuf_1\drug_exposure_1.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '0x0a',
ERRORFILE = 'C:\synpuf_1\drug_exposure_1.bad',
TABLOCK
);

TRUNCATE TABLE [location];
BULK INSERT [location]
FROM 'C:\synpuf_1\location_1.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '0x0a',
ERRORFILE = 'C:\synpuf_1\location_1.bad',
TABLOCK
);

TRUNCATE TABLE [measurement];
BULK INSERT [measurement]
FROM 'C:\synpuf_1\measurement_occurrence_1.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '0x0a',
ERRORFILE = 'C:\synpuf_1\measurement_occurrence_1.bad',
TABLOCK
);

TRUNCATE TABLE [observation];
BULK INSERT [observation]
FROM 'C:\synpuf_1\observation_1.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '0x0a',
ERRORFILE = 'C:\synpuf_1\observation_1.bad',
TABLOCK
);

TRUNCATE TABLE [observation_period];
BULK INSERT [observation_period]
FROM 'C:\synpuf_1\observation_period_1.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '0x0a',
ERRORFILE = 'C:\synpuf_1\observation_period_1.bad',
TABLOCK
);

TRUNCATE TABLE [payer_plan_period];
BULK INSERT [payer_plan_period]
FROM 'C:\synpuf_1\payer_plan_period_1.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '0x0a',
ERRORFILE = 'C:\synpuf_1\payer_plan_period_1.bad',
TABLOCK
);

TRUNCATE TABLE [person];
BULK INSERT [person]
FROM 'C:\synpuf_1\person_1.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '0x0a',
ERRORFILE = 'C:\synpuf_1\person_1.bad',
TABLOCK
);

TRUNCATE TABLE [procedure_occurrence];
BULK INSERT [procedure_occurrence]
FROM 'C:\synpuf_1\procedure_occurrence_1.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '0x0a',
ERRORFILE = 'C:\synpuf_1\procedure_occurrence_1.bad',
TABLOCK
);

TRUNCATE TABLE [provider];
BULK INSERT [provider]
FROM 'C:\synpuf_1\provider_1.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '0x0a',
ERRORFILE = 'C:\synpuf_1\provider_1.bad',
TABLOCK
);

TRUNCATE TABLE [specimen];
BULK INSERT [specimen]
FROM 'C:\synpuf_1\specimen_1.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '0x0a',
ERRORFILE = 'C:\synpuf_1\specimen_1.bad',
TABLOCK
);

TRUNCATE TABLE [visit_occurrence];
BULK INSERT [visit_occurrence]
FROM 'C:\synpuf_1\visit_occurrence_1.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '0x0a',
ERRORFILE = 'C:\synpuf_1\visit_occurrence_1.bad',
TABLOCK
);

ETL Devices

devices aren't yet generated by the ETL.

ETL Observations

observations aren't yet generated by the ETL.

Condition/Procedure Occurrence Type Concept IDs are wrong

According to the original author, the condition/procedure_type_concept_ids are not correctly generated by the ETL.

CMS_SynPuf_ETL_CDM_v5.py uses in the inpatient visit type for all visit occurrences

The Inpatient visit type is applied to all visit occurrences (9201). Dropping this here as a reminder to myself or anyone else to make the quick fix. Should also update the pre-compiled data files based on the fix.

ETL-CMS/python_etl/CMS_SynPuf_ETL_CDM_v5.py

Line 1436 in 35d0c30

visit_concept_id=OMOP_CONSTANTS.INPAT_VISIT_CONCEPT_ID,

drug_era, observation_period, cost column mismatch

Many thanks for providing the 100k download vie Google drive. However, I experienced some problems when importing the data.

drug_era

Importing drug_era fails: ERROR: invalid input syntax for integer: "2008-02-27"

Looking at the data, the columns don't match with the table definition in CDM (I checked also versions before and after 5.2): The third column should be the concept_id, not a date.

gunzip -c drug_era.gz | head
4107,528323,2008-02-27,2008-02-27,1,\000,215049
2060,722031,2008-03-11,2008-03-21,1,\000,105476
4938,710062,2008-02-14,2008-03-15,1,\000,258700

CREATE TABLE drug_era
(
drug_era_id INTEGER NOT NULL ,
person_id INTEGER NOT NULL ,
drug_concept_id INTEGER NOT NULL ,
drug_era_start_date DATE NOT NULL ,
drug_era_end_date DATE NOT NULL ,
drug_exposure_count INTEGER NULL ,
gap_days INTEGER NULL
)
;

observation_period

Similarly, the data for observation_period still has the observation_period_start_datetime and observation_period_stop_datetime columns, which were deprecated before 5.2.

cost

Last, the cost data does not fit the column definition by the CDM. The import fails, because "Procedure" is not an INTEGER.

gunzip -c cost.gz | head
189455742,Procedure,0,44818668,\000,\000,10,10,0,\000,0,0,0,\000,\000,\000,\000,\000,\000,0,\000,543966571
189455742,Procedure,0,44818668,\000,\000,20,20,0,\000,0,0,0,\000,\000,\000,\000,\000,\000,0,\000,543966572
189455743,Procedure,0,44818668,\000,\000,10,10,0,\000,0,0,0,\000,\000,\000,\000,\000,\000,0,\000,543966573

CREATE TABLE cost
(
cost_id INTEGER NOT NULL ,
cost_event_id INTEGER NOT NULL ,
cost_domain_id VARCHAR(20) NOT NULL ,
cost_type_concept_id INTEGER NOT NULL ,
currency_concept_id INTEGER NULL ,
total_charge NUMERIC NULL ,
total_cost NUMERIC NULL ,
total_paid NUMERIC NULL ,
paid_by_payer NUMERIC NULL ,
paid_by_patient NUMERIC NULL ,
paid_patient_copay NUMERIC NULL ,
paid_patient_coinsurance NUMERIC NULL ,
paid_patient_deductible NUMERIC NULL ,
paid_by_primary NUMERIC NULL ,
paid_ingredient_cost NUMERIC NULL ,
paid_dispensing_fee NUMERIC NULL ,
payer_plan_period_id INTEGER NULL ,
amount_allowed NUMERIC NULL ,
revenue_code_concept_id INTEGER NULL ,
reveue_code_source_value VARCHAR(50) NULL ,
drg_concept_id INTEGER NULL,
drg_source_value VARCHAR(3) NULL
)
;

Error in loading CDMv5 vocabularies into postgres

Hi,
I followed the steps in documentations to load data in postgres. but in part e of step 7, when I wanna load data using the sql file, I see the following error:

psql:load_CDMv5_vocabulary.sql:26: ERROR: extra data after last expected column
CONTEXT: COPY drug_strength, line 2: "42478670 913782 2 8576 2
0150817 20991231

I got vocabularies from OMOP Vocabulary Web Site and did not change it. I couldn't find the problem. can any one tell me what I am doing wrong.