Code Monkey home page Code Monkey logo

caml-mimic's People

Contributors

jamesmullenbach avatar sarahwie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

caml-mimic's Issues

Input and output of convolutional layer have different lengths

Hi there,

I read your paper and you mentioned that you used padding to ensure the input and output of convolutional layer have the same length. However, in the code you set padding=int(kernel_size/2), which can not ensure that. For example, the input is 111 and the kernel_size is 4, the input after padding would be 0011100 and after the length of output would be 4.

I googled and found that pytorch seems to have no similar function as tensorflow's 'same' padding for convolutional layer. Is there any workaround to achieve this?

Thanks a lot.

Can not reproduce

Hi,

Running predictions/DRCAML_mimic3_50/train_new_model.sh, I got such result:

evaluating on test
file for evaluation: ../../mimicdata/mimic3/test_50.csv

[MACRO] accuracy: 0.363, precision: 0.557, recall: 0.465, f-measure: 0.501, AUC: 0.855
[MICRO] accuracy: 0.389, precision: 0.619, recall: 0.511, f-measure: 0.560, AUC: 0.881

prec_at_5: 0.553
rec_at_5: 0.523

The performance of DR-CAML I got above is much worse than the one in Table5.
I cannot reproduce the result of CAML as well, while CNN works quite well as in Table5.
Could you release the parameters or new scripts that can reproduce the performance in the paper ?

Thank you very much !

Question about the last logistic layer.

Hi James,

I'm trying to understand the code in your model.py.

I see that at line 106, you have self.final = nn.Linear(num_filter_maps, Y). The Y is of dimension about 8930.

Please correctly me if I am wrong. As I understand, this nn.Linear is a trick, so that you can use the weight matrix so you can do pointwise multiplication in line 135 with y = self.final.weight.mul(m).sum(dim=2).add(self.final.bias).

nn.Linear is not used in the standard sense like in pytorch tutorial; for example, we would usually run self.final(some-input).

Is this correct?

Training times

Hi,
Could you report the training time as well as your hardware specifications?

Some statistics such as: time per batch (for different batch_size) and time per epoch would be good to have since you have sequences of 2'500 words!

Error in loading the trained model

Hi,

I trained the model based the code here. When I load my model back into python, I get an error.

This is how I call the command

training.py train_full.csv vocab.csv full conv_attn 100 --filter-size 10 --num-filter-maps 50 --dropout 0.2 --patience 10 --lr 0.0001 --test-model model_best_prec_at_8.pth --gpu --quiet

This is the error I get,

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 482, in load_state_dict
    own_state[name].copy_(param)
RuntimeError: inconsistent tensor size, expected tensor [51919 x 100] and src [51920 x 100] to have the same number of elements, but got 5191900 and 5192000 elements respectively at c:\pytorch\torch\lib\th\generic/THTensorCopy.c:86

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "training.py", line 355, in <module>
    main(args)
  File "training.py", line 31, in main
    args, model, optimizer, params, dicts = init(args)
  File "training.py", line 48, in init
    model = tools.pick_model(args, dicts)
  File "C:/Users/dat/Dropbox/caml-mimic\learn\tools.py", line 36, in pick_model
    model.load_state_dict(sd)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 487, in load_state_dict
    .format(name, own_state[name].size(), param.size()))
RuntimeError: While copying the parameter named embed.weight, whose dimensions in the model are torch.Size([51919, 100]) and whose dimensions in the checkpoint are torch.Size([51920, 100]).

It seems that the vocab.csv does not match with the vocab in the trained model. Is this because the "unknown token" being added after the vocab.csv was made?

I feel that there is some strange mismatching here. I check the vocab.csv and it has 51917 lines.

wc -l vocab.csv
51917 vocab.csv

Thanks.

Vector representation of the ICD9 description

Hi, from reading through the model.py. I see that the vector embedding of the label descriptions are trained jointly with the rest of the module. So "practically speaking", the 2nd module in section 2.5 of the paper is in fact trained jointly with the 1st module. Is this correct?

I guess to make the question more clear. When you say "2nd module" in the paper, you do not imply that the 2nd module is trained entirely independently of the "standard" model. Is this correct?

Thanks for your help.

ICD9 50 codes

I am wondering if you could share the top 50 ICD codes used in your work? Did you use DIAGNOSES_ICD to extract ICD codes? I looked into CAML_mimic3_50/preds_test.psv and found some codes, such as 37.22 and 96.72 are not in the MIMIC-III data.

Parameter Y (size of label space) is overwritten

This is just a minor issue but might be helpful for other users.
The parameter Y of the training script is not used for building the model. Instead, the size of the label space is set in the method 'pick_model' from the dictionary computed during data processing.
This makes sense but then the parameter Y should be removed.

Handling very large text field

Hi,

I am having problem with this line 40 in the training.py

csv.field_size_limit(sys.maxsize)

The error says OverflowError: Python int too large to convert to C long.

What do you thinking is causing this problem? Is it the total number of words (not unique words, but total count of words)?

Can I ask what was your memory usage?

Thanks.

No such file or directory: '../mimicdata/mimic3/train_full_hadm_ids.csv'

Hi @jamesmullenbach,

I'm getting an error while running the dataproc_mimic_III notebook:

dataproc/concat_and_split.pyc in split_data(labeledfile, base_name)
     61     for splt in ['train', 'dev', 'test']:
     62         hadm_ids[splt] = set()
---> 63         with open('%s/%s_full_hadm_ids.csv' % (MIMIC_3_DIR, splt), 'r') as f:
     64             for line in f:
     65                 hadm_ids[splt].add(line.rstrip())
IOError: [Errno 2] No such file or directory: '../mimicdata/mimic3/train_full_hadm_ids.csv'

The README.md states that these files are already in the repository:

|   |   *_hadm_ids.csv (already in repo)

However, it looks like they are not. Where can these files be found? Am I missing something?

Issue in train_ful, test_full, dev_full files

I prepared the data following dataproc_mimic_III.ipynb file and i got six file i.e train_50, test_50, dev_50, train_full, test_full, dev_full. I am facing problem with train_full, test_full and dev_full such that train_full contain 8686 unique labels, test_full contain 4075 unique labels and dev_full contains 3009 unique labels. I don't know why labels are not of equal size in each file and now how to make them of equal size so that I can train my model.

kindly help me

Error in concat_and_split.py function split_data

Everything fine in the notebook for mimic3 until:
tr, dv, te = concat_and_split.split_data(fname, base_name=base_name)

notes_labeled.csv
disch_full.csv

are OK, generated successfully but hadm_id = row[1] looks like there is an empty row somewhere in the header, no?

SPLITTING
0 read

IndexError Traceback (most recent call last)
in
----> 1 tr, dv, te = concat_and_split.split_data(fname, base_name=base_name)

~\Documents\GitHub\caml-mimic\dataproc\concat_and_split.py in split_data(labeledfile, base_name)
75 print(str(i) + " read")
76
---> 77 hadm_id = row[1]
78
79 if hadm_id in hadm_ids['train']:

IndexError: list index out of range

Sigmoid before classification

Hi,

If I got it right, models.py > ConvAttnPool is the relevant model to the suggested CAML architecture in the article.
Looking in the forward function, I see that the last action occuring before calculating loss is linear (multiplying by final.weight & adding final.bias):

y = self.final.weight.mul(m).sum(dim=2).add(self.final.bias)

but theres no sigmoid after that, as suggested in the paper:
image

What did I miss?

Thanks :-)
Mor

A question about calculating precision@k

Hello,

I have a question about the function precision_at_k in evaluation.py. I think the denominator should be the amount of 1 predictions made among the top k predictions, however, in the code, length of top k is used. For example, if only 1 true prediction in the top 5, the denominator should be 1 but in this case it would still be 5.

Here is my modification:

def precision_at_k(yhat, yhat_raw, y, k):
    #num true labels in top k predictions / num 1 predictions in top k 
    sortd = np.argsort(yhat_raw)[:,::-1]
    topk = sortd[:,:k]

    #get precision at k for each example
    vals = []
    for i, tk in enumerate(topk):
        if len(tk) > 0:
            num_true_in_top_k = y[i,tk].sum()
            denom = yhat[i,tk].sum()
            if denom == 0: # in case no true predictions made in top k
                vals.append(1)
            else:
                vals.append(num_true_in_top_k / float(denom))

    return np.mean(vals)

Could you take a look at it? Correct me if I am wrong.

padding, softmax, embeddings

Hi,

I have two questions regarding the CAML implementation:

  1. All the texts in a batch are padded, but the input to the softmax function is not masked. Hence, this implementation also assigns positives attentions to padding tokens, right? Do I miss something here?
  2. The embedding vector that belongs to the padding tokens does not seem to be fixed to the zero vector. If not, then where is that constraint implemented? (I guess it wouldn't make a difference if 1. was handled differently, i.e. if the attentions for padding vectors would be fixed to 0).

Many thanks!

Understanding the prediction psv file

Hi James,

I am trying to understand the prediction psv file. In the preds_dev.psv, why is it that each row has different length? Shouldn't each row have the same number of predicted ICD9 code (at least 15 because of the metric Precision@8 and Precision@15) ?

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.