jamesmullenbach / caml-mimic Goto Github PK

View Code? Open in Web Editor NEW

268.0 14.0 122.0 150.4 MB

multilabel classification of EHR notes

License: MIT License

Python 71.60% Jupyter Notebook 25.02% Shell 3.38%

multilabel-classification clinical-notes ehr-notes icd9 nlp

caml-mimic's People

Contributors

Stargazers

Watchers

Forkers

izzykayu jeremija sarahwie yangxg languageandintelligence ssbpv guohan950106 amirunpri2018 littleredhat leanderloew manuyavuz andrewdoss davidkartchner michael-wzhu yue1harriet1 zhanghk-pku jhong17 foxlf823 cpf-nlpr zhhengcs chengwang88 kiddding aparnapai7 tienthanhdhcn qw12563 aabayarea pwforks fabianoswald moonkang ballcap231 che0816 mstackhouse tonydeep leidingfei wolfhu jakesieghardt acadtags omelette96 corey886 mashuangwe ridongwang kiendang sdxshuai vitallish krayush07 wwt20 joshzastrow volcano299 tangwj17 benedictau1993 laobaixing ryannetwork bobsru isabelcachola ganeshmurugesan srubob chorseng bigdatamatta jtmory jeshuren zkz eltomate wyzhang120 adityahulk yuzhou96 yacaikk eleven1liu sparklingwater2021 ianhutomo tlaifer gabrielsandoval rizhashmi icyguy64 franzbischoff abheesht17 jaekyoungkim sarathsurpur ckre wangjiangxing mpw18 ruilialice yaminadjoudi csmetzner ntnshrav bsabri dmcguire81 eaglefox31 ssahgal anvesh-reddy cuchon mm666-y whaleloops ryanxjhan ntaylorox soap117 amaratariq hertera1 jwaldor innernull anigi98932

caml-mimic's Issues

Input and output of convolutional layer have different lengths

Hi there,

I read your paper and you mentioned that you used padding to ensure the input and output of convolutional layer have the same length. However, in the code you set padding=int(kernel_size/2), which can not ensure that. For example, the input is 111 and the kernel_size is 4, the input after padding would be 0011100 and after the length of output would be 4.

I googled and found that pytorch seems to have no similar function as tensorflow's 'same' padding for convolutional layer. Is there any workaround to achieve this?

Thanks a lot.

Can not reproduce

Hi,

Running predictions/DRCAML_mimic3_50/train_new_model.sh, I got such result:

evaluating on test
file for evaluation: ../../mimicdata/mimic3/test_50.csv

[MACRO] accuracy: 0.363, precision: 0.557, recall: 0.465, f-measure: 0.501, AUC: 0.855
[MICRO] accuracy: 0.389, precision: 0.619, recall: 0.511, f-measure: 0.560, AUC: 0.881

prec_at_5: 0.553
rec_at_5: 0.523

The performance of DR-CAML I got above is much worse than the one in Table5.
I cannot reproduce the result of CAML as well, while CNN works quite well as in Table5.
Could you release the parameters or new scripts that can reproduce the performance in the paper ?

Thank you very much !

Question about the last logistic layer.

Hi James,

I'm trying to understand the code in your model.py.

I see that at line 106, you have self.final = nn.Linear(num_filter_maps, Y). The Y is of dimension about 8930.

Please correctly me if I am wrong. As I understand, this nn.Linear is a trick, so that you can use the weight matrix so you can do pointwise multiplication in line 135 with y = self.final.weight.mul(m).sum(dim=2).add(self.final.bias).

nn.Linear is not used in the standard sense like in pytorch tutorial; for example, we would usually run self.final(some-input).

Is this correct?

Training times

Hi,
Could you report the training time as well as your hardware specifications?

Some statistics such as: time per batch (for different batch_size) and time per epoch would be good to have since you have sequences of 2'500 words!

Error in loading the trained model

Hi,

I trained the model based the code here. When I load my model back into python, I get an error.

This is how I call the command

training.py train_full.csv vocab.csv full conv_attn 100 --filter-size 10 --num-filter-maps 50 --dropout 0.2 --patience 10 --lr 0.0001 --test-model model_best_prec_at_8.pth --gpu --quiet

This is the error I get,

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 482, in load_state_dict
    own_state[name].copy_(param)
RuntimeError: inconsistent tensor size, expected tensor [51919 x 100] and src [51920 x 100] to have the same number of elements, but got 5191900 and 5192000 elements respectively at c:\pytorch\torch\lib\th\generic/THTensorCopy.c:86

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "training.py", line 355, in <module>
    main(args)
  File "training.py", line 31, in main
    args, model, optimizer, params, dicts = init(args)
  File "training.py", line 48, in init
    model = tools.pick_model(args, dicts)
  File "C:/Users/dat/Dropbox/caml-mimic\learn\tools.py", line 36, in pick_model
    model.load_state_dict(sd)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 487, in load_state_dict
    .format(name, own_state[name].size(), param.size()))
RuntimeError: While copying the parameter named embed.weight, whose dimensions in the model are torch.Size([51919, 100]) and whose dimensions in the checkpoint are torch.Size([51920, 100]).

It seems that the vocab.csv does not match with the vocab in the trained model. Is this because the "unknown token" being added after the vocab.csv was made?

I feel that there is some strange mismatching here. I check the vocab.csv and it has 51917 lines.

wc -l vocab.csv
51917 vocab.csv

Thanks.

Vector representation of the ICD9 description

Hi, from reading through the model.py. I see that the vector embedding of the label descriptions are trained jointly with the rest of the module. So "practically speaking", the 2nd module in section 2.5 of the paper is in fact trained jointly with the 1st module. Is this correct?

I guess to make the question more clear. When you say "2nd module" in the paper, you do not imply that the 2nd module is trained entirely independently of the "standard" model. Is this correct?

Thanks for your help.

ICD9 50 codes

I am wondering if you could share the top 50 ICD codes used in your work? Did you use DIAGNOSES_ICD to extract ICD codes? I looked into CAML_mimic3_50/preds_test.psv and found some codes, such as 37.22 and 96.72 are not in the MIMIC-III data.

Why fields like admission date, discharge data, sex, etc fields were not removed

@jamesmullenbach @sarahwie
Can you please tell me why you didn't remove words like admission date, discharge data, sex, etc.
If needed, can we remove those

Parameter Y (size of label space) is overwritten

This is just a minor issue but might be helpful for other users.
The parameter Y of the training script is not used for building the model. Instead, the size of the label space is set in the method 'pick_model' from the dictionary computed during data processing.
This makes sense but then the parameter Y should be removed.

Handling very large text field

Hi,

I am having problem with this line 40 in the training.py

csv.field_size_limit(sys.maxsize)

The error says OverflowError: Python int too large to convert to C long.

What do you thinking is causing this problem? Is it the total number of words (not unique words, but total count of words)?

Can I ask what was your memory usage?

Thanks.

Are the BoW files available for log_reg.py?

Wondered if we can fully reproduce the log_reg.py? It seems that something missing here. Thanks!

Best,
A

about code_emb

how can i get code_emb

"read_csv" introduces bugs when treat ICD9_CODE be integer by default. And codes can appear more than once for a single HADM_ID

They both make the top50 statistics changed. And with MIMIC-iii 1.4, the full label seems not to be 8922 anymore (if the ICD9_CODE is read as str)

criterion paratemter in the command line of training a new model about top 50 codes

Just mention that while training a new model among the top 50 labels, we need to set the criterion as precision_at_5 instead of precision_at_8. Otherwise, in training.py it will not save model.pth file.

No such file or directory: '../mimicdata/mimic3/train_full_hadm_ids.csv'

Hi @jamesmullenbach,

I'm getting an error while running the dataproc_mimic_III notebook:

dataproc/concat_and_split.pyc in split_data(labeledfile, base_name)
     61     for splt in ['train', 'dev', 'test']:
     62         hadm_ids[splt] = set()
---> 63         with open('%s/%s_full_hadm_ids.csv' % (MIMIC_3_DIR, splt), 'r') as f:
     64             for line in f:
     65                 hadm_ids[splt].add(line.rstrip())
IOError: [Errno 2] No such file or directory: '../mimicdata/mimic3/train_full_hadm_ids.csv'

The README.md states that these files are already in the repository:

|   |   *_hadm_ids.csv (already in repo)

However, it looks like they are not. Where can these files be found? Am I missing something?

Issue in train_ful, test_full, dev_full files

I prepared the data following dataproc_mimic_III.ipynb file and i got six file i.e train_50, test_50, dev_50, train_full, test_full, dev_full. I am facing problem with train_full, test_full and dev_full such that train_full contain 8686 unique labels, test_full contain 4075 unique labels and dev_full contains 3009 unique labels. I don't know why labels are not of equal size in each file and now how to make them of equal size so that I can train my model.

kindly help me

Error in concat_and_split.py function split_data

Everything fine in the notebook for mimic3 until:
tr, dv, te = concat_and_split.split_data(fname, base_name=base_name)

notes_labeled.csv
disch_full.csv

are OK, generated successfully but hadm_id = row[1] looks like there is an empty row somewhere in the header, no?

SPLITTING
0 read

IndexError Traceback (most recent call last)
in
----> 1 tr, dv, te = concat_and_split.split_data(fname, base_name=base_name)

~\Documents\GitHub\caml-mimic\dataproc\concat_and_split.py in split_data(labeledfile, base_name)
75 print(str(i) + " read")
76
---> 77 hadm_id = row[1]
78
79 if hadm_id in hadm_ids['train']:

IndexError: list index out of range

It seems like MIMIC-ii is not available from Physionet any more. Can anyone tell me where to download the dataset?

Sigmoid before classification

Hi,

If I got it right, models.py > ConvAttnPool is the relevant model to the suggested CAML architecture in the article.
Looking in the forward function, I see that the last action occuring before calculating loss is linear (multiplying by final.weight & adding final.bias):

y = self.final.weight.mul(m).sum(dim=2).add(self.final.bias)

but theres no sigmoid after that, as suggested in the paper:

What did I miss?

Thanks :-)
Mor

Informativeness annotations

Hello, is there a plan to release informativeness annotations ? (span+ label + expert annotation)

A question about calculating precision@k

Hello,

I have a question about the function precision_at_k in evaluation.py. I think the denominator should be the amount of 1 predictions made among the top k predictions, however, in the code, length of top k is used. For example, if only 1 true prediction in the top 5, the denominator should be 1 but in this case it would still be 5.

Here is my modification:

def precision_at_k(yhat, yhat_raw, y, k):
    #num true labels in top k predictions / num 1 predictions in top k 
    sortd = np.argsort(yhat_raw)[:,::-1]
    topk = sortd[:,:k]

    #get precision at k for each example
    vals = []
    for i, tk in enumerate(topk):
        if len(tk) > 0:
            num_true_in_top_k = y[i,tk].sum()
            denom = yhat[i,tk].sum()
            if denom == 0: # in case no true predictions made in top k
                vals.append(1)
            else:
                vals.append(num_true_in_top_k / float(denom))

    return np.mean(vals)

Could you take a look at it? Correct me if I am wrong.

padding, softmax, embeddings

Hi,

I have two questions regarding the CAML implementation:

All the texts in a batch are padded, but the input to the softmax function is not masked. Hence, this implementation also assigns positives attentions to padding tokens, right? Do I miss something here?
The embedding vector that belongs to the padding tokens does not seem to be fixed to the zero vector. If not, then where is that constraint implemented? (I guess it wouldn't make a difference if 1. was handled differently, i.e. if the attentions for padding vectors would be fixed to 0).

Many thanks!

Understanding the prediction psv file

Hi James,

I am trying to understand the prediction psv file. In the preds_dev.psv, why is it that each row has different length? Shouldn't each row have the same number of predicted ICD9 code (at least 15 because of the metric Precision@8 and Precision@15) ?

Thanks.

Cannot reproduce logreg

the logistic regression part in training.py seems not complete.....

jamesmullenbach / caml-mimic Goto Github PK

caml-mimic's People

Contributors

Stargazers

Watchers

Forkers

caml-mimic's Issues

SPLITTING 0 read

Recommend Projects

Recommend Topics

Recommend Org

SPLITTING
0 read