mcgill-nlp / medal Goto Github PK

Large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain

Home Page: https://arxiv.org/abs/2012.13978

Jupyter Notebook 11.39% Python 86.73% Shell 1.88%

medal's People

Contributors

Stargazers

Watchers

Forkers

xhluca markusbuchholz aspirincode samrsa dagelf sailfish009 trendingtechnology allensmile georgi-petkov gachet qianrenjian ziggyfloat rogervaas ssahgal bhaskarbharat habibmrad sts-sadr albat3ross nathan-bk gayansamuditha rupanjali1 xinransong danielphamvt techthiyanes jiajinuiuc easy-forks anush36 mamunissacproton sheifc ginixsan hertera1 sandy4321 aisi746467512 khanhdat111 jcarias76 xwallsaaitrcm srinu2003 hahustat matheus-rech

medal's Issues

Unable to recreate results

Trained Electra model following instructions in the Github page.
I also used the same train-test-val split as mentioned.
Trained Electra model on 2 GPUs (Multi-gpu setting) 2080ti's for 10 epochs.

The training was for 10 epochs and took 5 whole days to train, Is that normal ? Can we make it faster?
There is no mention of how many epochs the model was trained to get the results furnished in paper.
The validation accuracy is very low after training for 10 epochs

End of epoch 9
Train Loss: 9.5633 Train Accuracy:0.0004
Valid Loss: 9.5592 Valid Accuracy:0.0003

It would be of great help if someone could help us with the above issues as soon as possible.

Update requirements.txt typo protobuf==3.15.0

Need to change protobuf=3.15.0 to protobuf==3.15.0 in requirements.txt.

Doubt regarding pre-training of ELECTRA

Hi Bruce,

I liked the idea behind your MeDAL dataset and pre-training concept using abbreviations.

While going through the code, I am not understanding following things.

How are you making the model predict the full form of the abbreviation in a sentence ?
Also plz explain what does below block of code does..

abbs = torch.stack([sents[n, idx, :] for n, idx in enumerate(locs)]) # (B * M)

Inferencing for custom dataset

Hi @BruceWen120 ,
How do I use this model for my own healthcare dataset?
Please share the steps how to do the inferencing from this model.
Thank you !!

Question regarding abbreviations and preprocessing

Hello BruceWen!

thanks for your work and making the medal dataset available!

I loaded the dataset and was looking into the most frequently occuring abbreviations from the first 50000 rows and was wondering why the abbreviations all seem rather cryptic:

It seems that occurences of "after" were substituted with "T3" in the articles through the reverse-substitution process, but I don't understand where the abbreviation "T3" stems from.

I found the original publication of the first article in the dataset with the "T3" abbreviation and it does not mention T3 anywhere but just "after", which was then replaced through your procedure.

Publication: https://pubmed.ncbi.nlm.nih.gov/22/

I also noticed that the text in the medal dataset does not contain and punctuation. Why was this removed? The part in your publication where the creation of the medal dataset is presented, does not go into details here.

And you modified abbreviations from prior research and ended up with 5,886 abbreviations in total with corresponding expansions. Is this dataframe/dictionary also available for download?

This is the code I used to reconstruct such abbreviation-expansion dictionary based on the medal dataset now:

from tqdm.auto import tqdm 
import pandas as pd

medal_df = pd.read_csv(path_to_full_data_csv)
medal_df.head()

loop_df = medal_df[:50000]
overall_article_abrbeviations_df = pd.DataFrame()

for index, row in tqdm(loop_df.iterrows(), total=len(loop_df)):

    start_positions = [int(x) for x in row["LOCATION"].split("|")]
    resolutions = row["LABEL"].split("|")
    abbreviations = []
    for start in range(len(start_positions)):
        abbreviations.append(row["TEXT"].split(" ")[start_positions[start]])

    this_article_abbreviations_df = pd.DataFrame()
    this_article_abbreviations_df["Abbreviations"]=abbreviations
    this_article_abbreviations_df["Resolutions"]=resolutions
    overall_article_abrbeviations_df = pd.concat([overall_article_abrbeviations_df,this_article_abbreviations_df])
    
abbreviation_df = overall_article_abrbeviations_df.groupby(['Abbreviations', 'Resolutions']).size().sort_values(ascending=False).reset_index(name='count')

I inspected the most frequently occuring and least frequently occuring abbreviations and I suspect that some error happened during the reverse-substitution process for most of the frequently occuring abbreviations:

Hopefully you can help me understand these things a little bit better :)

Zenodo URL in README does not download the dataset

The url for download should be changed from

https://zenodo.org/record/4265632/files/full_data.csv.zip

https://zenodo.org/record/4482922/files/full_data.csv.zip?download=1

the old URL gets redirected to

https://zenodo.org/record/4482922

Doubts in Expansion of Abbreviation in Coding

Hello BruceWen!

I'm looking exclusively for abbreviations expansions in sentences and I came across your github repo. I've few doubts from your work.

I started using the pre-trained models one by one starting with LSTM ( lstm = torch.hub.load("BruceWen120/medal", "lstm").
I want to disambiguate the abbreviations in top 10 sentences of test data downloaded from kaggle and later use my data. When I converted the test data to (batch=10 , sequence=230, dim=300) format using fast text on crawl-300d-2M-subword.bin and gave input to the model, I got output of 22555 dim (as given) for every data point. Can you tell me what is this output referring to ? What is the next step after getting 22555 dim outputs to get the expansions? From where and how can I get the expansions from these 22555 dim outputs?
This tool of yours will disambiguate only single abbreviation in a sentence but not multiple as it location based. So, it is mandatory to have locations in test data. Can you tell me how did you detected and extracted the locations of abbreviations from a given text ? Because, this has to be done on my text data also.

Expecting the reply soon...

mcgill-nlp / medal Goto Github PK

medal's People

Contributors

Stargazers

Watchers

Forkers

medal's Issues

Unable to recreate results

End of epoch 9
Train Loss: 9.5633 Train Accuracy:0.0004
Valid Loss: 9.5592 Valid Accuracy:0.0003

Update requirements.txt typo protobuf==3.15.0

Doubt regarding pre-training of ELECTRA

Inferencing for custom dataset

Question regarding abbreviations and preprocessing

Zenodo URL in README does not download the dataset

Doubts in Expansion of Abbreviation in Coding

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

mcgill-nlp / medal Goto Github PK

medal's People

Contributors

Stargazers

Watchers

Forkers

medal's Issues

End of epoch 9 Train Loss: 9.5633 Train Accuracy:0.0004 Valid Loss: 9.5592 Valid Accuracy:0.0003

Recommend Projects

Recommend Topics

Recommend Org

End of epoch 9
Train Loss: 9.5633 Train Accuracy:0.0004
Valid Loss: 9.5592 Valid Accuracy:0.0003