Code Monkey home page Code Monkey logo

medical-coding-reproducibility's Introduction

⚕️Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study

Official source code repository for the SIGIR 2023 paper Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study

@inproceedings{edinAutomatedMedicalCoding2023,
  address = {Taipei, Taiwan},
  title = {Automated {Medical} {Coding} on {MIMIC}-{III} and {MIMIC}-{IV}: {A} {Critical} {Review} and {Replicability} {Study}},
  isbn = {978-1-4503-9408-6},
  shorttitle = {Automated {Medical} {Coding} on {MIMIC}-{III} and {MIMIC}-{IV}},
  doi = {10.1145/3539618.3591918},
  booktitle = {Proceedings of the 46th {International} {ACM} {SIGIR} {Conference} on {Research} and {Development} in {Information} {Retrieval}},
  publisher = {ACM Press},
  author = {Edin, Joakim and Junge, Alexander and Havtorn, Jakob D. and Borgholt, Lasse and Maistro, Maria and Ruotsalo, Tuukka and Maaløe, Lars},
  year = {2023}
}

Introduction

Automatic medical coding is the task of automatically assigning diagnosis and procedure codes based on discharge summaries from electronic health records. This repository contains the code used in the paper Automated medical coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study. The repository contains code for training and evaluating medical coding models and new splits for MIMIC-III and the newly released MIMIC-IV. The following models have been implemented:

Model Paper Original Code
CNN Explainable Prediction of Medical Codes from Clinical Text link
Bi-GRU Explainable Prediction of Medical Codes from Clinical Text link
CAML Explainable Prediction of Medical Codes from Clinical Text link
MultiResCNN ICD Coding from Clinical Text Using Multi-Filter Residual Convolutional Neural Network link
LAAT A Label Attention Model for ICD Coding from Clinical Text link
PLM-ICD PLM-ICD: Automatic ICD Coding with Pretrained Language Models link

The splits are found in files/data. The splits are described in the paper.

How to reproduce results

Setup Conda environment

  1. Create a conda environement conda create -n coding python=3.10
  2. Install the packages pip install . -e

Prepare MIMIC-III

This code has been developed on MIMIC-III v1.4.

  1. Download the MIMIC-III data into your preferred location path/to/mimiciii. Please note that you need to complete training to acces the data. The training is free, but takes a couple of hours. - link to data access
  2. Open the file src/settings.py
  3. Change the variable DOWNLOAD_DIRECTORY_MIMICIII to the path of your downloaded data path/to/mimiciii
  4. If you want to use the MIMIC-III full and MIMIC-III 50 from the Explainable Prediction of Medical Codes from Clinical Text you need to run python prepare_data/prepare_mimiciii_mullenbach.py
  5. If you want to use MIMIC-III clean from our paper you need to run python prepare_data/prepare_mimiciii.py

Prepare MIMIC-IV

This code has been developed on MIMIC-IV and MIMIC-IV v2.2.

  1. Download MIMIC-IV and MIMIC-IV-NOTE into your preferred location path/to/mimiciv and path/to/mimiciv-note. Please note that you need to complete training to acces the data. The training is free, but takes a couple of hours. - link to data access
  2. Open the file src/settings.py
  3. Change the variable DOWNLOAD_DIRECTORY_MIMICIV to the path of your downloaded data path/to/mimiciv
  4. Change the variable DOWNLOAD_DIRECTORY_MIMICIV_NOTE to the path of your downloaded data path/to/mimiciv-note
  5. Run python prepare_data/prepare_mimiciv.py

Before running experiments

  1. Create a weights and biases account. It is possible to run the experiments without wandb.
  2. Download the model checkpoints and unzip it. Please note that these model weights can't be used commercially due to the MIMIC License.
  3. If you want to train PLM-ICD, you need to download RoBERTa-base-PM-M3-Voc, unzip it and change the model_path parameter in configs/model/plm_icd.yaml and configs/text_transform /huggingface.yaml to the path of the download.

Running experiments

Training

You can run any experiment found in configs/experiment. Here are some examples:

  • Train PLM-ICD on MIMIC-III clean on GPU 0: python main.py experiment=mimiciii_clean/plm_icd gpu=0
  • Train CAML on MIMIC-III full on GPU 6: python main.py experiment=mimiciii_full/caml gpu=6
  • Train LAAT on MIMIC-IV ICD-9 full on GPU 6: python main.py experiment=mimiciv_icd9/laat gpu=6
  • Train LAAT on MIMIC-IV ICD-9 full on GPU 6 without weights and biases: python main.py experiment=mimiciv_icd9/laat gpu=6 callbacks=no_wandb trainer.print_metrics=true

Evaluation

If you just want to evaluate the models using the provided model_checkpoints you need to do set trainer.epochs=0 and provide the path to the models checkpoint load_model=path/to/model_checkpoint. Make sure you the correct model-checkpoint with the correct configs.

Example: Evaluate PLM-ICD on MIMIC-IV ICD-10 on GPU 1: python main.py experiment=mimiciv_icd10/plm_icd gpu=1 load_model=path/to/model_checkpoints/mimiciv_icd10/plm_icd trainer.epochs=0

Overview of the repository

configs

We use Hydra for configurations. The condigs for every experiment is found in configs/experiments. Furthermore, the configuration for the sweeps are found in configs/sweeps. We used Weights and Biases Sweeps for most of our experiments.

files

This is where the images and data is stored.

notebooks

The directory only contains one notebook used for the code analysis. The notebook is not aimed to be used by others, but is included for others to validate our data analysis.

prepare_data

The directory contains all the code for preparing the datasets and generating splits.

reports

This is the code used to generate the plots and tables used in the paper. The code uses the Weights and Biases API to fetch the experiment results. The code is not usable by others, but was included for the possibility to validate our figures and tables.

src

This is were the code for running the experiments is found.

tests

The directory contains the unit tests

My setup

I ran the experiments on one RTX 2080 Ti 11GB per experiment. I had 128 GB RAM on my machine.

⚠️ Known issues

  • LAAT and PLM-ICD are unstable. The loss will sometimes diverge during training. The issue seems to be overflow in the softmax function in the label-wise attention. Using batch norm or layer norm before the softmax function might solve the issue. We did not try to fix the issue as we didn't want to change the original method during our reproducibility.
  • The code was only tested on a server with 128 GB RAM. A user with 32 GB RAM reported issues fitting MIMIC-IV into memory.
  • There is an error in the collate function in the Huggingface dataset. The attention mask is being padded with 1s instead of 0s. I have not fixed this issue because I want people to be able to reproduce the results from the paper.

Acknowledgement

Thank you Sotiris Lamprinidis for providing an efficient implementation of our multi-label stratification algorithm and some data preprocessing helper functions.

medical-coding-reproducibility's People

Contributors

joakimedin avatar pminervini avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

medical-coding-reproducibility's Issues

Correct the sample command

Hello,
Thank you for sharing your codes.
In the evaluation section of the explanations, you need to modify epochs=0 to trainer.epochs=0.

100% Memory usage crash during MIMIC-IV preprocessing

Hi thank you for open-sourcing your project.

I'm trying to run the MIMIC-IV preprocessing python prepare_data/prepare_mimiciv.py (step 5 of the Prepare MIMIC-IV section in README.md) but after 2+ hours it seems to crashing and infinitely spawning new processes and consuming 100% of my machine's RAM.

Unfortunately I was not able to capture the stack as I had to power off my machine as it had crashed; but the errors seemed to be related to vaex and multi-processing related. I have tried this twice now, each time cloning the repo from scratch with the same outcome.

If it helps I am on windows 10 running python 3.10 with 32GB RAM and a 16 core CPU.

MIMIC-IV notes

Hello,

I am wondering how you concatenate the notes and ICD-10 code inside the MIMIC-IV dataset since the MIMIC-IV Note dataset only contains the subject ID and notes. I do not see that the MIMIC-IV dataset contains the related variable. Could you please figure this out for me?

On the preprocessing of the notes

Hi, love what you're doing here.

wanted to ask two questions:

1- Is there a particular reason for choosing this specific preprocessing?

MIN_TARGET_COUNT = 10 # Minimum number of times a code must appear to be included
preprocessor = TextPreprocessor(
lower=True,
remove_special_characters_mullenbach=True,
remove_special_characters=False,
remove_digits=True,
remove_accents=False,
remove_brackets=False,
convert_danish_characters=False,
)

2- I also wanted to ask about the stratification you used, I'm working on sequential disease prediction task ( predict next visit diags from previous visits ). so the targets are the next visits ICDs, and I wanted to have your thoughts on whether the stratification you used could be adapated to what I'm doing.

Inconsistent reproduction of study results of PLM-ICD on MIMIC-III-full and MIMIC-III-50.

I effectively replicated the PLM-ICD study results across three datasets: MIMIC-III-clean, MIMIC-IV-ICD9, and MIMIC-IV-ICD10. Nevertheless, upon executing the experiment scripts, I came across inconsistencies in the outcomes for the following experiments:

For the full MIMIC-III dataset (mimiciii_full/plm_icd):

  • Command: python main.py experiment=mimiciii_full/plm_icd.yaml gpu=0 callbacks=no_wandb
  • Evaluation command: python main.py experiment=mimiciii_full/plm_icd.yaml gpu=0 callbacks=no_wandb trainer.epochs=0 trainer.print_metrics=true

For the MIMIC-III-50 dataset (mimiciii_50/plm_icd):

  • Command: python main.py experiment=mimiciii_50/plm_icd.yaml gpu=0 callbacks=no_wandb
  • Evaluation command: python main.py experiment=mimiciii_50/plm_icd.yaml gpu=0 callbacks=no_wandb trainer.print_metrics=true trainer.epochs=0

Following the training and evaluation phases, the obtained results were as follows:

For mimiciii_full:

{
│   'all': {
│   │   'f1_micro': tensor(0.0088),
│   │   'f1_macro': tensor(0.0083),
│   │   'recall_micro': tensor(1.),
│   │   'recall_macro': tensor(1.),
│   │   'precision_micro': tensor(0.0044),
│   │   'precision_macro': tensor(0.0044),
│   │   'fpr_micro': tensor(1.),
│   │   'fpr_macro': tensor(1.),
│   │   'exact_match_ratio': tensor(0., device='cuda:0'),
│   │   'precision@5': tensor(0.0081),
│   │   'precision@8': tensor(0.0074),
│   │   'precision@15': tensor(0.0062),
│   │   'recall@10': tensor(0.0037),
│   │   'recall@15': tensor(0.0050),
│   │   'map': tensor(0.0070),
│   │   'precision@recall': tensor(0.0059),
│   │   'loss': tensor(0.0275, device='cuda:0'),
│   │   'auc_micro': 0.5086265528944932,
│   │   'auc_macro': 0.5915369966061897
│   },
│   'icd9_diag': {
│   │   'f1_micro': tensor(0.0090),
│   │   'f1_macro': tensor(0.0085),
│   │   'recall_micro': tensor(1.),
│   │   'recall_macro': tensor(1.),
│   │   'precision_micro': tensor(0.0045),
│   │   'precision_macro': tensor(0.0045),
│   │   'fpr_micro': tensor(1.),
│   │   'fpr_macro': tensor(1.),
│   │   'exact_match_ratio': tensor(0., device='cuda:0'),
│   │   'precision@5': tensor(0.0094),
│   │   'precision@8': tensor(0.0078),
│   │   'precision@15': tensor(0.0064),
│   │   'recall@10': tensor(0.0051),
│   │   'recall@15': tensor(0.0065),
│   │   'map': tensor(0.0078),
│   │   'precision@recall': tensor(0.0065),
│   │   'loss': tensor(0.0275, device='cuda:0'),
│   │   'auc_micro': 0.5086265528944932,
│   │   'auc_macro': 0.5915369966061897
│   },
│   'icd9_proc': {
│   │   'f1_micro': tensor(0.0094),
│   │   'f1_macro': tensor(0.0089),
│   │   'recall_micro': tensor(1.),
│   │   'recall_macro': tensor(1.),
│   │   'precision_micro': tensor(0.0047),
│   │   'precision_macro': tensor(0.0047),
│   │   'fpr_micro': tensor(1.),
│   │   'fpr_macro': tensor(1.),
│   │   'exact_match_ratio': tensor(0., device='cuda:0'),
│   │   'precision@5': tensor(0.0040),
│   │   'precision@8': tensor(0.0045),
│   │   'precision@15': tensor(0.0048),
│   │   'recall@10': tensor(0.0097),
│   │   'recall@15': tensor(0.0165),
│   │   'map': tensor(0.0122),
│   │   'precision@recall': tensor(0.0043),
│   │   'loss': tensor(0.0275, device='cuda:0'),
│   │   'auc_micro': 0.5086265528944932,
│   │   'auc_macro': 0.5915369966061897
│   }
}

For mimiciii_50:

{
│   'all': {
│   │   'f1_micro': tensor(0.2162),
│   │   'f1_macro': tensor(0.2083),
│   │   'recall_micro': tensor(1.),
│   │   'recall_macro': tensor(1.),
│   │   'precision_micro': tensor(0.1212),
│   │   'precision_macro': tensor(0.1212),
│   │   'fpr_micro': tensor(1.),
│   │   'fpr_macro': tensor(1.),
│   │   'exact_match_ratio': tensor(0., device='cuda:0'),
│   │   'precision@5': tensor(0.1131),
│   │   'precision@8': tensor(0.1126),
│   │   'precision@15': tensor(0.1156),
│   │   'recall@10': tensor(0.1919),
│   │   'recall@15': tensor(0.2884),
│   │   'map': tensor(0.1749),
│   │   'precision@recall': tensor(0.1123),
│   │   'loss': tensor(1.1453, device='cuda:0'),
│   │   'auc_micro': 0.5560300278281775,
│   │   'auc_macro': 0.6060359934689991
│   },
│   'icd9_diag': {
│   │   'f1_micro': tensor(0.2464),
│   │   'f1_macro': tensor(0.2373),
│   │   'recall_micro': tensor(1.),
│   │   'recall_macro': tensor(1.),
│   │   'precision_micro': tensor(0.1405),
│   │   'precision_macro': tensor(0.1405),
│   │   'fpr_micro': tensor(1.),
│   │   'fpr_macro': tensor(1.),
│   │   'exact_match_ratio': tensor(0., device='cuda:0'),
│   │   'precision@5': tensor(0.1216),
│   │   'precision@8': tensor(0.1217),
│   │   'precision@15': tensor(0.1233),
│   │   'recall@10': tensor(0.2696),
│   │   'recall@15': tensor(0.4098),
│   │   'map': tensor(0.2068),
│   │   'precision@recall': tensor(0.1274),
│   │   'loss': tensor(1.1453, device='cuda:0'),
│   │   'auc_micro': 0.5560300278281775,
│   │   'auc_macro': 0.6060359934689991
│   },
│   'icd9_proc': {
│   │   'f1_micro': tensor(0.2473),
│   │   'f1_macro': tensor(0.2392),
│   │   'recall_micro': tensor(1.),
│   │   'recall_macro': tensor(1.),
│   │   'precision_micro': tensor(0.1411),
│   │   'precision_macro': tensor(0.1411),
│   │   'fpr_micro': tensor(1.),
│   │   'fpr_macro': tensor(1.),
│   │   'exact_match_ratio': tensor(0., device='cuda:0'),
│   │   'precision@5': tensor(0.1636),
│   │   'precision@8': tensor(0.1578),
│   │   'precision@15': tensor(0.1435),
│   │   'recall@10': tensor(0.5971),
│   │   'recall@15': tensor(0.8803),
│   │   'map': tensor(0.2809),
│   │   'precision@recall': tensor(0.1585),
│   │   'loss': tensor(1.1453, device='cuda:0'),
│   │   'auc_micro': 0.5560300278281775,
│   │   'auc_macro': 0.6060359934689991
│   }
}{
│   'all': {
│   │   'f1_micro': tensor(0.2162),
│   │   'f1_macro': tensor(0.2083),
│   │   'recall_micro': tensor(1.),
│   │   'recall_macro': tensor(1.),
│   │   'precision_micro': tensor(0.1212),
│   │   'precision_macro': tensor(0.1212),
│   │   'fpr_micro': tensor(1.),
│   │   'fpr_macro': tensor(1.),
│   │   'exact_match_ratio': tensor(0., device='cuda:0'),
│   │   'precision@5': tensor(0.1131),
│   │   'precision@8': tensor(0.1126),
│   │   'precision@15': tensor(0.1156),
│   │   'recall@10': tensor(0.1919),
│   │   'recall@15': tensor(0.2884),
│   │   'map': tensor(0.1749),
│   │   'precision@recall': tensor(0.1123),
│   │   'loss': tensor(1.1453, device='cuda:0'),
│   │   'auc_micro': 0.5560300278281775,
│   │   'auc_macro': 0.6060359934689991
│   },
│   'icd9_diag': {
│   │   'f1_micro': tensor(0.2464),
│   │   'f1_macro': tensor(0.2373),
│   │   'recall_micro': tensor(1.),
│   │   'recall_macro': tensor(1.),
│   │   'precision_micro': tensor(0.1405),
│   │   'precision_macro': tensor(0.1405),
│   │   'fpr_micro': tensor(1.),
│   │   'fpr_macro': tensor(1.),
│   │   'exact_match_ratio': tensor(0., device='cuda:0'),
│   │   'precision@5': tensor(0.1216),
│   │   'precision@8': tensor(0.1217),
│   │   'precision@15': tensor(0.1233),
│   │   'recall@10': tensor(0.2696),
│   │   'recall@15': tensor(0.4098),
│   │   'map': tensor(0.2068),
│   │   'precision@recall': tensor(0.1274),
│   │   'loss': tensor(1.1453, device='cuda:0'),
│   │   'auc_micro': 0.5560300278281775,
│   │   'auc_macro': 0.6060359934689991
│   },
│   'icd9_proc': {
│   │   'f1_micro': tensor(0.2473),
│   │   'f1_macro': tensor(0.2392),
│   │   'recall_micro': tensor(1.),
│   │   'recall_macro': tensor(1.),
│   │   'precision_micro': tensor(0.1411),
│   │   'precision_macro': tensor(0.1411),
│   │   'fpr_micro': tensor(1.),
│   │   'fpr_macro': tensor(1.),
│   │   'exact_match_ratio': tensor(0., device='cuda:0'),
│   │   'precision@5': tensor(0.1636),
│   │   'precision@8': tensor(0.1578),
│   │   'precision@15': tensor(0.1435),
│   │   'recall@10': tensor(0.5971),
│   │   'recall@15': tensor(0.8803),
│   │   'map': tensor(0.2809),
│   │   'precision@recall': tensor(0.1585),
│   │   'loss': tensor(1.1453, device='cuda:0'),
│   │   'auc_micro': 0.5560300278281775,
│   │   'auc_macro': 0.6060359934689991
│   }
}

huggingface_hub.utils._validators.HFValidationError

I am trying to run PYTHONPATH=. python main.py experiment=mimiciv_icd10/plm_icd gpu=0 but I get the following error

  File "/opt/conda/envs/coding/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/envs/coding/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/opt/conda/envs/coding/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/opt/conda/envs/coding/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/abc/medical-coding-reproducibility/main.py", line 76, in main
    text_transform = get_transform(
  File "/home/abc/medical-coding-reproducibility/src/factories.py", line 108, in get_transform
    transform_class = getattr(transform, config.name)(**config.configs)
  File "/home/abc/medical-coding-reproducibility/src/data/transform.py", line 234, in __init__
    self.tokenizer = AutoTokenizer.from_pretrained(model_path, **kwargs)
  File "/opt/conda/envs/coding/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 598, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
  File "/opt/conda/envs/coding/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 442, in get_tokenizer_config
    resolved_config_file = cached_file(
  File "/opt/conda/envs/coding/lib/python3.10/site-packages/transformers/utils/hub.py", line 409, in cached_file
    resolved_file = hf_hub_download(
  File "/opt/conda/envs/coding/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    validate_repo_id(arg_value)
  File "/opt/conda/envs/coding/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 166, in validate_repo_id
    raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '<path/to/RoBERTa-base-PM-M3-Voc-hf>'. Use `repo_type` argument if needed.

My hunch is that there must be a problem with my config file, which looks like

(coding) abc@embeddings:~/medical-coding-reproducibility$ ls RoBERTa-base-PM-M3-Voc/RoBERTa-base-PM-M3-Voc-hf/
config.json  merges.txt  pytorch_model.bin  vocab.json
(coding) abc@embeddings:~/medical-coding-reproducibility$ cat configs/model/plm_icd.yaml
name: PLMICD
configs:
  model_path: /home/abc/medical-coding-reproducibility/RoBERTa-base-PM-M3-Voc/RoBERTa-base-PM-M3-Voc-hf

Any idea what is wrong here?

Setup error on Win10

Hi, I'm getting a setup error on windows when executing step 2 fromin the readme.
I think the command should be pip install -e . but the instruction says pip install . -e which is not a valid pip command as far as I know.

Here's the trace from within my anaconda terminal:image

Edit updated to a screenshot as text was malformatted.

huggingface_hub.utils._validators.HFValidationError while evaluating the mimiciii_clean/plm_icd model.

Description

I am attempting to evaluate the "mimiciii_clean/plm_icd" model on the "mimiciii_clean" dataset following these steps:

  1. Dataset Preparation: I prepared the dataset by running the "prepare_data/prepare_mimiciii_clean.py" script on the 'mimiciii' directory.

  2. Checkpoint Download: I downloaded the necessary checkpoints from the link provided in the README.md file. Here's a screenshot confirming the presence of these checkpoints for the "plm_icd" model:
    Checkpoint Screenshot

  3. Execution Command: I executed the following command:

    python main.py experiment=mimiciii_clean/plm_icd gpu=0 load_model=/content/files/model_checkpoints/mimiciii_clean/plm-icd +epochs=0 callbacks=no_wandb
    

Error Encountered

After running the command, I encountered the following error:

[2023-08-31 05:28:20,075][infotropy.utils.random][INFO] - Set 'numpy', 'random' and 'torch' random seed to 1337
'Device: cuda'
'CUDA_VISIBLE_DEVICES: 0'
loaded transform
Error executing job with overrides: ['experiment=mimiciii_clean/plm_icd', 'gpu=0', 'load_model=/content/files/model_checkpoints/mimiciii_clean/plm-icd', '+epochs=0', 'callbacks=no_wandb']
Traceback (most recent call last):
  File "/content/main.py", line 150, in <module>
    main()  # pylint: disable=no-value-for-parameter
  File "/usr/local/lib/python3.10/dist-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/content/main.py", line 76, in main
    text_transform = get_transform(
  File "/content/src/factories.py", line 108, in get_transform
    transform_class = getattr(transform, config.name)(**config.configs)
  File "/content/src/data/transform.py", line 234, in __init__
    self.tokenizer = AutoTokenizer.from_pretrained(model_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 677, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 510, in get_tokenizer_config
    resolved_config_file = cached_file(
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 428, in cached_file
    resolved_file = hf_hub_download(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
    validate_repo_id(arg_value)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
    raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '<path/to/RoBERTa-base-PM-M3-Voc-hf>'. Use `repo_type` argument if needed.

Could someone please verify the correctness of my steps and provide guidance on resolving this error? Your assistance would be greatly appreciated.

`attention_mask` padded with 1

Hi, thanks for sharing :-) I find this very helpful.

A question I have is regarding how you pad the attention_mask during collage data.

attention_mask = self.text_transform.seq2batch(
attention_mask, chunk_size=self.chunk_size
)

Since seq2batch function pads with self.tokenizer.pad_token_id, the attention_mask variable is padded with 1. Shouldn't it be padded with 0 as padded tokens should not be used in attention?

Thanks,

OSError: Read-only File System Error While Running prepare_mimiciii_mullenbach.py

I encountered an error while trying to run the prepare_data/prepare_mimiciii_mullenbach.py script. The error message I received is as follows:
OSError: [Errno 30] Read-only file system: '/content/MIMIC_Dataset/mimic-iii-clinical-database-1.4/NOTEEVENTS.feather'

Directory Structure:
Here is the directory structure that contains both the MIMIC-III and MIMIC-IV datasets:

├── 📁 MIMIC_Dataset
│ ├── 📁 mimic-iv-2.2
│ │ ├── 📄 CHANGELOG.txt
│ │ ├── 📄 LICENSE.txt
│ │ ├── 📄 SHA256SUMS.txt
│ │ ├── 📁 hosp
│ │ │ ├── 📄 omr.csv.gz
│ │ │ ├── 📄 microbiologyevents.csv.gz
│ │ │ ├── 📄 d_icd_procedures.csv.gz
│ │ │ ├── 📄 admissions.csv.gz
│ │ │ ├── 📄 d_labitems.csv.gz
│ │ │ ├── 📄 drgcodes.csv.gz
│ │ │ ├── 📄 patients.csv.gz
│ │ │ ├── 📄 hcpcsevents.csv.gz
│ │ │ ├── 📄 d_hcpcs.csv.gz
│ │ │ ├── 📄 d_icd_diagnoses.csv.gz
│ │ │ ├── 📄 diagnoses_icd.csv.gz
│ │ │ ├── 📄 provider.csv.gz
│ │ │ ├── 📄 transfers.csv.gz
│ │ │ ├── 📄 pharmacy.csv.gz
│ │ │ ├── 📄 services.csv.gz
│ │ │ ├── 📄 procedures_icd.csv.gz
│ │ │ ├── 📄 prescriptions.csv.gz
│ │ │ ├── 📄 emar.csv.gz
│ │ │ ├── 📄 emar_detail.csv.gz
│ │ │ ├── 📄 poe.csv.gz
│ │ │ ├── 📄 poe_detail.csv.gz
│ │ │ └── 📄 labevents.csv.gz
│ │ └── 📁 icu
│ │ ├── 📄 caregiver.csv.gz
│ │ ├── 📄 d_items.csv.gz
│ │ ├── 📄 datetimeevents.csv.gz
│ │ ├── 📄 procedureevents.csv.gz
│ │ ├── 📄 outputevents.csv.gz
│ │ ├── 📄 chartevents.csv.gz
│ │ ├── 📄 ingredientevents.csv.gz
│ │ ├── 📄 inputevents.csv.gz
│ │ └── 📄 icustays.csv.gz
│ └── 📁 mimic-iii-clinical-database-1.4
│ ├── 📄 ADMISSIONS.csv.gz
│ ├── 📄 CAREGIVERS.csv.gz
│ ├── 📄 CALLOUT.csv.gz
│ ├── 📄 D_ICD_DIAGNOSES.csv.gz
│ ├── 📄 checksum_md5_zipped.txt
│ ├── 📄 D_CPT.csv.gz
│ ├── 📄 D_LABITEMS.csv.gz
│ ├── 📄 checksum_md5_unzipped.txt
│ ├── 📄 D_ITEMS.csv.gz
│ ├── 📄 DIAGNOSES_ICD.csv.gz
│ ├── 📄 D_ICD_PROCEDURES.csv.gz
│ ├── 📄 DRGCODES.csv.gz
│ ├── 📄 DATETIMEEVENTS.csv.gz
│ ├── 📄 ICUSTAYS.csv.gz
│ ├── 📄 CPTEVENTS.csv.gz
│ ├── 📄 CHARTEVENTS.csv.gz
│ ├── 📄 INPUTEVENTS_MV.csv.gz
│ ├── 📄 INPUTEVENTS_CV.csv.gz
│ ├── 📄 MICROBIOLOGYEVENTS.csv.gz
│ ├── 📄 LABEVENTS.csv.gz
│ ├── 📄 OUTPUTEVENTS.csv.gz
│ ├── 📄 PATIENTS.csv.gz
│ ├── 📄 NOTEEVENTS.csv.gz
│ ├── 📄 SERVICES.csv.gz
│ ├── 📄 PROCEDUREEVENTS_MV.csv.gz
│ ├── 📄 README.md
│ ├── 📄 PROCEDURES_ICD.csv.gz
│ ├── 📄 TRANSFERS.csv.gz
│ ├── 📄 PRESCRIPTIONS.csv.gz
│ ├── 📄 LICENSE.txt
│ └── 📄 SHA256SUMS.txt
└── 📄 Allow cookies

Questions:

  1. I'm not sure if my directory structure is appropriate for running the code. Can you please review it and confirm whether it's set up correctly?

  2. The "README.md" also mentions (in the "Prepare MIMIC-IV" section) changing the variable DOWNLOAD_DIRECTORY_MIMICIV_NOTE to the path of the downloaded data. However, I'm unsure where to find this path in my MIMIC-IV dataset. Could you provide guidance on where to locate this path?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.