Code Monkey home page Code Monkey logo

data-analytics-using-electronic-health-records's Introduction

Data Analytics Pipeline to Predict Outcomes using Structured Medical Data (Electronic Health Records)

Table of contents

About the repository

This repository contains a generic EHR (Electronic Health Records) preprocessing pipeline completed at the Clinical AI Lab, New York University Abu Dhabi.

Authors

Ghadeer Ghosheh : Research Assistant, Clinical AI Lab

Farah Shamout: Assistant Professor Emerging Scholar, Clinical AI Lab

Requirements

Please clone and download the project, unzip it to your preferred folder. Then run the following code in your computer to download the requirements.

pip install -r requirements.txt

Repository Content

Preprocessing highlights

Dataset

For the sake of demonstration, we generated a dummy dataset that contains typical EHR variables. The dummy dataset is composed of 1000 patient instances. The EHR includes typical demographic, vitals, and lab results along with a variable under the name of "Hospital death" which reports the final clinical outcome of the patients hospital stay.

Exploratory Data Analysis

Prior to apply any machine learning models, it is essential to understand the characteristics of the data at hand. Exploratory Data Analysis (EDA) is an initial investigation of the data with the aim to discover patterns, to detect anomalies, and better understand the variables and their distribution.

In this folder, we present a notebook showing an investigation of the dummy dataset that will be used throughout the repository. The notebook introduces some functions used to visualize the data as well as some libraries that accelerate the data analysis. The folder also includes a profiling report "report_dummy.html" that highlights the main quantile and descriptive statistical attributes, correlations, missing values and outliers present in the dataset.

Data preprocessing

This script includes functions that can be used preprocess Electronic Health records that are stored in "csv" file format. The file includes a set of functions that handle typical issues in EHR such as missing values, implausible and invalid values, and categorical encoding. The script also includes multiple functions that demonstrate examples of feature engineering that are specifically useful in clinical data analysis.

After calling the functions, the resulting datasets are stored in a pkl file under the name "prepared.pkl".

In the script, all the functions have details on the purpose and usage of each of them.

Model

After preprocessing, datasets are usually split into training and testing sets The training set is used to train the machine learning models, while the test set is used to test and validate the predictions of the machine learning models. In this project we present a sample notebook that introduces commonly used libraries for building machine learning models applied to the dummy dataset with a with 8:2 split ratio for the training and testing sets.

The models included in the sample notebook are Logistic Regression, Multi-layer Perceptron Regressor, Support Vector Machine, Gradient Boosting Regressor, and Ridge Regressor.

Plausible Data Dictionary

An important aspect of clinical data preprocessing is making sure that the data used to train the machine learning model is clinically valid and free of implausible values.

  • "Plausible_EHR.csv" is a plausible data dictionary compiled based on ANZICS core- Adult Patient Database for ICU patients.

This dictionary includes 28 numerical clinical features such as labs, vitals, and labs blood gas. This is a helpful tool that can be used to detect and impute implausible and invalid entries.

  • Note: Please make sure that the unit of measure in the "Unit of measure" column matches the one used in your dataset to ensure proper implausible values detection.

Citation

For this work, please cite: DOI

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.