Code Monkey home page Code Monkey logo

pyhealth's Introduction

Python Library for Healthcare AI (PyHealth)

PyPI version

Documentation status

GitHub stars

GitHub forks

Build status

Maintainability

License


Development Status: As of 08/02/2020, PyHealth is under active development and in its alpha stage. Please follow, star, and fork to get the latest functions!

PyHealth is a comprehensive and flexible Python library for healthcare AI, designed for both ML researchers and medical practitioners. The library is proudly developed and maintained by researchers at Carnegie Mellon University, IQVIA, and University of Illinois at Urbana-Champaign. PyHealth makes many important healthcare tasks become accessible, such as phenotyping prediction, mortality prediction, ICU length stay forecasting, etc. Running these prediction tasks with deep learning models can be as short as 10 lines of code.

PyHealth comes with three major modules: (i) data preprocessing module; (ii) learning module and (iii) evaluation module. Typically, one can run the data prep module to prepare the data, then feed to the learning module for prediction, and finally assess the result with the evaluation module. Users can use the full system as mentioned or just selected modules based on the own need:

  • Deep learning researchers may directly use the processed data along with the proposed new models.
  • Medical personnel, may leverage our data preprocessing module to convert the medical data to the format that learning models could digest, and then perform the inference tasks to get insights from the data.

PyHealth is featured for:

  • Unified APIs, detailed documentation, and interactive examples across various datasets and algorithms.
  • Advanced models, including latest deep learning models and classical machine learning models.
  • Optimized performance with JIT and parallelization when possible, using numba and joblib.
  • Customizable modules and flexible design: each module may be turned on/off or totally replaced by custom functions. The trained models can be easily exported and reloaded for fast exexution and deployment.

API Demo for LSTM on Phenotyping Prediction:

# load pre-processed CMS dataset
from pyhealth.data.expdata_generator import cms as cms_expdata_generator

cur_dataset = cms_expdata_generator(exp_id=exp_id, sel_task='phenotyping')
cur_dataset.get_exp_data()
cur_dataset.load_exp_data()

# initialize the model for training
from pyhealth.models.lstm import LSTM
clf = LSTM(exp_id, task='phenotyping')
clf.fit(cur_dataset.train, cur_dataset.valid)

# load the best model for inference
clf.load_model()
clf.inference(cur_dataset.test)
pred_results = clf.get_results()

# evaluate the model
from pyhealth import evaluation
evaluator = evaluation.__dict__['phenotyping']
r = evaluator(pred_results['hat_y'], pred_results['y'])

Citing PyHealth:

PyHealth paper is under review at JMLR (machine learning open-source software track). If you use PyHealth in a scientific publication, we would appreciate citations to the following paper:

@article{zhao2020pyhealth,
  author  = {Zhao, Yue and Qiao, Zhi and Xiao, Cao and Glass, Lucas and Hu, Xiyang and Sun, Jimeng},
  title   = {PyHealth: A Python Library for Healthcare AI},
  year    = {2020},
}

or:

Zhao, Y., Qiao, Z., Xiao, C., Glass, L., Hu, X and Sun, J., 2020. PyHealth: A Python Library for Healthcare AI.

Key Links and Resources:

Table of Contents:


Installation

It is recommended to use pip for installation. Please make sure the latest version is installed, as PyHealth is updated frequently:

pip install pyhealth            # normal install
pip install --upgrade pyhealth  # or update if needed
pip install --pre pyhealth      # or include pre-release version for new features

Alternatively, you could clone and run setup.py file:

git clone https://github.com/yzhao062/pyhealth.git
cd pyhealth
pip install .

Required Dependencies:

  • Python 3.5, 3.6, or 3.7
  • combo>=0.0.8
  • joblib
  • numpy>=1.13
  • numba>=0.35
  • pandas>=0.24
  • scipy>=0.19.1
  • scikit_learn>=0.20
  • torch
  • xlrd >= 1.0.0

Warning 1: PyHealth has multiple neural network based models, e.g., LSTM, which are implemented in PyTorch. However, PyHealth does NOT install these DL libraries for you. This reduces the risk of interfering with your local copies. If you want to use neural-net based models, please make sure PyTorch is installed. Similarly, models depending on xgboost, would NOT enforce xgboost installation by default.


API Cheatsheet & Reference

Full API Reference: (https://pyhealth.readthedocs.io/en/latest/pyhealth.html). API cheatsheet for most learning models:

  • fit(X_train, X_valida): Fit a learning model.
  • inference(X): Predict on X using the fitted estimator.
  • evaluator(y, y^hat): Model evaluation.

Model load and reload:

  • load_model(): Load the best model so far.

Preprocessed Datasets & Implemented Algorithms

(i) Preprocessed Datasets (customized data preprocessing function is provided in the example folders):

Type Abbr Description Processed Function Link
EHR-ICU MIMIC III A relational database containing tables of data relating to patients who stayed within ICU. \examples\data_generation\dataloader_mimic https://mimic.physionet.org/gettingstarted/overview/
EHR-ICU MIMIC_demo The MIMIC-III demo database is limited to 100 patients and excludes the noteevents table. \examples\data_generation\dataloader_mimic_demo https://mimic.physionet.org/gettingstarted/demo/
EHU-Claim CMS DE-SynPUF: CMS 2008-2010 Data Entrepreneurs Synthetic Public Use File \examples\data_generation\dataloader_cms https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs

You may download the above datasets at the links. The structure of the generated datasets can be found in datasets folder:

  • \datasets\cms\x_datat\...csv
  • \datasets\cms\y_data\phenotyping.csv
  • \datasets\cms\y_data\mortality.csv

The processed datasets (X,y) should be put in x_data, y_data correspondingly, to be appropriately digested by deep learning models.

(ii) Machine Learning and Deep Learning Models :

Type Abbr Algorithm Year Ref
Classical Models LogisticReg Logistic Regression N/A
Classical Models XGBoost XGBoost: A scalable tree boosting system 2016 1
Neural Networks LSTM Long short-term memory 1997 2
Neural Networks GRU Gated recurrent unit 2014 3
Neural Networks RETAIN RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism 2016 4
Neural Networks Dipole Dipole: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks 2017 5
Neural Networks tLSTM Patient Subtyping via Time-Aware LSTM Networks 2017 6
Neural Networks RAIM RAIM: Recurrent Attentive and Intensive Model of Multimodal Patient Monitoring Data 2018 7
Neural Networks StageNet StageNet: Stage-Aware Neural Networks for Health Risk Prediction 2020 8

Examples of running ML and DL models can be found below, or directly at \examples\learning_examples\

(iii) Evaluation Metrics :

Type Abbr Metric Method
Binary Classification average_precision_score Compute micro/macro average precision (AP) from prediction scores pyhealth.evaluation.xxx.get_avg_results
Binary Classification roc_auc_score Compute micro/macro ROC AUC score from prediction scores pyhealth.evaluation.xxx.get_avg_results

Binary Classification Multi Classification

recall, precision, f1 To be done here

Get recall, precision, and f1 values

pyhealth.evaluation.xxx.get_predict_results

(iv) Supported Tasks:

Type Abbr Description Method
Multi-classification phenotyping Predict the diagnosis code of a patient based on other information, e.g., procedures \examples\data_generation\generate_phenotyping_xxx.py
Binary Classification mortality prediction Predict whether a patient may pass away during the hospital \examples\data_generation\generate_mortality_xxx.py
Regression ICU stay length pred Forecast the length of an ICU stay \examples\data_generation\generate_icu_length_xxx.py

Quick Start for Data Processing

We propose the idea of standard template, a formalized schema for healthcare datasets. Ideally, as long as the data is scanned as the template we defined, the downstream task processing and the use of ML models will be easy and standard. In short, it has the following structure: add a figure here. The dataloader for different datasets can be found in examples/data_generation. Using "examples/data_generation/dataloader_mimic_demo.py" as an exmaple:

  1. First read in patient, admission, and event tables.

    from pyhealth.utils.utility import read_csv_to_df
    patient_df = read_csv_to_df(os.path.join('data', 'mimic-iii-clinical-database-demo-1.4', 'PATIENTS.csv'))
    admission_df = read_csv_to_df(os.path.join('data', 'mimic-iii-clinical-database-demo-1.4', 'ADMISSIONS.csv'))
    ...
  2. Then invoke the parallel program to parse the tables in n_jobs cores.

    from pyhealth.data.base_mimic import parallel_parse_tables
    all_results = Parallel(n_jobs=n_jobs, max_nbytes=None, verbose=True)(
    delayed(parallel_parse_tables)(
         patient_df=patient_df,
         admission_df=admission_df,
         icu_df=icu_df,
         event_df=event_df,
         event_mapping_df=event_mapping_df,
         duration=duration,
         save_dir=save_dir)
     for i in range(n_jobs))
  3. The processed sequential data will be saved in the prespecified directory.

    with open(patient_data_loc, 'w') as outfile:
        json.dump(patient_data_list, outfile)

The provided examples in PyHealth mainly focus on scanning the data tables in the schema we have, and generate episode datasets. For instance, "examples/data_generation/dataloader_mimic_demo.py" demonstrates the basic procedure of processing MIMIC III demo datasets.

  1. The next step is to generate episode/sequence data for mortality prediction. See "examples/data_generation/generate_mortality_prediction_mimic_demo.py"

    with open(patient_data_loc, 'w') as outfile:
        json.dump(patient_data_list, outfile)

By this step, the dataset has been processed for generating X, y for phenotyping prediction. It is noted that the API across most datasets are similar. One may easily replicate this procedure by calling the data generation scripts in \examples\data_generation. You may also modify the parameters in the scripts to generate the customized datasets.

Preprocessed datasets are also available at \datasets\cms and \datasets\mimic.


Quick Start for Running Predictive Models

"examples/learning_models/lstm_cms_example.py" demonstrates the basic API of using LSTM for phenotyping prediction. It is noted that the API across all other algorithms are consistent/similar.

If you do not have the preprocessed datasets yet, download the \datasets folder (cms.zip and mimic.zip) from PyHealth repository, and run \examples\learning_models\extract_data_run_before_learning.py to prepare/unzip the datasets.

  1. Setup the datasets. X and y should be in x_data and y_data, respectively.

    # load pre-processed CMS dataset
    from pyhealth.data.expdata_generator import cms as cms_expdata_generator
    
    cur_dataset = cms_expdata_generator(exp_id=exp_id, sel_task='phenotyping')
    cur_dataset.get_exp_data()
    cur_dataset.load_exp_data()
  2. Initialize a LSTM model, you may set up the parameters of the LSTM, e.g., n_epoch, learning_rate, etc,.

    # initialize the model for training
    from pyhealth.models.lstm import LSTM
    clf = LSTM(exp_id, task='phenotyping')
    clf.fit(cur_dataset.train, cur_dataset.valid)
  3. Load the best shot of the training, predict on the test datasets

    # load the best model for inference
    clf.load_model()
    clf.inference(cur_dataset.test)
    pred_results = clf.get_results()
  4. Evaluation on the model. Multiple metrics are supported.

    # evaluate the model
    from pyhealth import evaluation
    evaluator = evaluation.__dict__['phenotyping']
    r = evaluator(pred_results['hat_y'], pred_results['y'])

Algorithm Benchmark

The comparison among of implemented models will be made available later with a benchmark paper. TBA soon :)

Blueprint & Development Plan

The long term goal of PyHealth is to become a comprehensive healthcare AI toolkit that supports beyond EHR data, but also the images and clinical notes.

  • The support of image datasets and clinical notes
  • The compatibility and the support of OMOP format datasets
  • Model persistence (save, load, and portability)
  • The release of a benchmark paper with PyHealth

Reference


  1. Chen, T. and Guestrin, C., 2016, August. Xgboost: A scalable tree boosting system. In KDD.

  2. Hochreiter, S. and Schmidhuber, J., 1997. Long short-term memory. Neural computation, 9(8), pp.1735-1780.

  3. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. and Bengio, Y., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

  4. Choi, E., Bahadori, M.T., Sun, J., Kulas, J., Schuetz, A. and Stewart, W., 2016. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. In Advances in Neural Information Processing Systems (pp. 3504-3512).

  5. Ma, F., Chitta, R., Zhou, J., You, Q., Sun, T. and Gao, J., 2017, August. Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1903-1911).

  6. Baytas, I.M., Xiao, C., Zhang, X., Wang, F., Jain, A.K. and Zhou, J., 2017, August. Patient subtyping via time-aware lstm networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 65-74).

  7. Xu, Y., Biswal, S., Deshpande, S.R., Maher, K.O. and Sun, J., 2018, July. Raim: Recurrent attentive and intensive model of multimodal patient monitoring data. In Proceedings of the 24th ACM SIGKDD international conference on Knowledge Discovery & Data Mining (pp. 2565-2573).

  8. Gao, J., Xiao, C., Wang, Y., Tang, W., Glass, L.M. and Sun, J., 2020, April. StageNet: Stage-Aware Neural Networks for Health Risk Prediction. In Proceedings of The Web Conference 2020 (pp. 530-540).

pyhealth's People

Contributors

yzhao062 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.