Code Monkey home page Code Monkey logo

cancerrisknet's Introduction

Pancreatic cancer risk predicted from disease trajectories using deep learning

Introduction

The repository contains the code implementation used for the paper Pancreatic cancer risk predicted from disease trajectories using deep learning. We used disease trajectories from EHR to calculate the risk of developing pancreatic cancer at different intervals after the assessment. The repository supports different deep learning models.

Usage

Data structure

In the data folder you can find a synthetic data with the same structure of the data used in the study. In case you want to reproduce the results on another dataset you need to generate a json file having the same structure. The events do not need to be sorted by 'admdate'.

{
    "PID_0":{
        "birtdate":"1900-01-01",
        "end_of_data":"2022-01-01",
        "events":[
            {
                "admdate":"2000-01-01",
                "admid":"00000000",
                "codes":"E10",
            },
            {
                "admdate":"1990-01-01",
                "admid":"00000001",
                "codes":"C25",
            }
        ]
    },


    
    "PID_9999":{
        "birtdate":"1900-01-01",
        "end_of_data":"2022-01-01",
        "events":[
            {
                "admdate":"1980-01-01",
                "admid":"00000002",
                "codes":"K54",
            },
            {
                "admdate":"2010",
                "admid":"00000004",
                "codes":"C10",
            }
        ]
    }
}

STEP 1

In order to run any experiment, a config file file needs to be saved under the configs folder. Check the possible arguments in the parsing file

{
  "search_space": {
    "metadata":["data/sample_diag_data.json"],
    "train":[true],
    "dev":[true],
    "test":[false],
    "epochs":[2],
    "num_workers":[8],
    "cuda":[true],
    "hidden_dim":[256],
    "model_name": ["transformer", "gru", "bow"],
    "optimizer": ["adam"],
    "init_lr":[1e-03, 1e-04],
    "train_batch_size": [4],
    "eval_batch_size": [4],
    "max_batches_per_train_epoch": [5],
    "max_batches_per_dev_epoch": [5],
    "max_events_length": [1000],
    "max_eval_indices": [10],
    "pad_size": [200],
    "eval_auroc": [true],
    "eval_auprc": [true],
    "eval_c_index": [true],
    "data_setting_path": ["data/settings_sample_data.yaml"]
  },
  "available_gpus": [1]
}

In the data folder a YAML file must be saved with the following information: - PANC_CANCER_CODE [required]: used to define the outcome - END_OF_TIME_DATE [required]: date of the last possible record in the data - KNOWN_RISK_FACTORS [optional]: used to run experiment using uniquely the known risk factors - ICD8_MAPPER_NAME, ICD9_MAPPER_NAME, ICD10_MAPPER_NAME [optional]: files uses in the visualization to transform disease codes to plain text.

Before running any experiments few checks on the requirements, along with constructing the vocabulary for the models, have to be done running the following command:

python scripts/Step1-CheckFiles.py --experiment_config_path configs/sample_grid_search.json

Resolve the possible error and warnings raised by the script before proceeding with the next step.

STEP 2

The command used to run experiments is:

python scripts/Step2-ModelTrainScheduler --experiment_config_path configs/sample_grid_search.json --search_name sample_search --scheduler single_node_scheduler

The --scheduler single_node_scheduler (default) runs the experiment sequentially on the machine where the script is launched. An initial support is also available for torque/moab and google cloud (see --help).

STEP 3

At the end of Step2 output you can find the command you should probably run to collect the grid search you just run.

python -u scripts/Step3-CollectSearchResults.py \
    --experiment_config_path configs/sample_grid_search.json \
    --search_dir searches/untitled-search_d94b4902_20220210-2247 \
    --result_dir results/untitled-search_d94b4902_20220210-2247

Note that the folder name inside searches and results (in the example example untitled-search_d94b4902_20220210-2247) is generated using the --search_name argument used in Step2 (untitiled-search is default), a short MD5 extracted from the specific parameter for that exact grid search and the datetime. In this way each hyperparameter search will never generate the same folder with the same name.

If the collection returns some warning (i.e. missing test because the --test flag was not used in the search config) it is possible to rerun Step2 using the in search config the parameter --resume_from_result, which will reload the model and arguments and will run --train / --dev/ --test accordingly to the new config.

STEP 4

The final step to collect the results is to generate a table with all the metrics (including CI) running the command below.

python scripts/Step4-ResultBootstrap.py --search_metadata searches/untitled-search_d94b4902_20220210-2247/performance_table.csv

The file performance_table.csv is the output file generated in the Step3 and contains all the experiment that will be used in Step4 to generate the new table with metrics/confidence intervals/curves coordinates. Use the flag --filename to specify the name of the Step4 output.

cancerrisknet's People

Contributors

daplaci avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.