Code Monkey home page Code Monkey logo

medal's People

Contributors

brucewen120 avatar dependabot[bot] avatar xhluca avatar xhlulu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

medal's Issues

Unable to recreate results

Trained Electra model following instructions in the Github page.
I also used the same train-test-val split as mentioned.
Trained Electra model on 2 GPUs (Multi-gpu setting) 2080ti's for 10 epochs.

  1. The training was for 10 epochs and took 5 whole days to train, Is that normal ? Can we make it faster?

  2. There is no mention of how many epochs the model was trained to get the results furnished in paper.

  3. The validation accuracy is very low after training for 10 epochs

    End of epoch 9
    Train Loss: 9.5633 Train Accuracy:0.0004
    Valid Loss: 9.5592 Valid Accuracy:0.0003

It would be of great help if someone could help us with the above issues as soon as possible.

Doubt regarding pre-training of ELECTRA

Hi Bruce,

I liked the idea behind your MeDAL dataset and pre-training concept using abbreviations.

While going through the code, I am not understanding following things.

  1. How are you making the model predict the full form of the abbreviation in a sentence ?

  2. Also plz explain what does below block of code does..

abbs = torch.stack([sents[n, idx, :] for n, idx in enumerate(locs)]) # (B * M)

Question regarding abbreviations and preprocessing

Hello BruceWen!

thanks for your work and making the medal dataset available!

I loaded the dataset and was looking into the most frequently occuring abbreviations from the first 50000 rows and was wondering why the abbreviations all seem rather cryptic:

grafik

It seems that occurences of "after" were substituted with "T3" in the articles through the reverse-substitution process, but I don't understand where the abbreviation "T3" stems from.

grafik

I found the original publication of the first article in the dataset with the "T3" abbreviation and it does not mention T3 anywhere but just "after", which was then replaced through your procedure.

Publication: https://pubmed.ncbi.nlm.nih.gov/22/

I also noticed that the text in the medal dataset does not contain and punctuation. Why was this removed? The part in your publication where the creation of the medal dataset is presented, does not go into details here.

And you modified abbreviations from prior research and ended up with 5,886 abbreviations in total with corresponding expansions. Is this dataframe/dictionary also available for download?

This is the code I used to reconstruct such abbreviation-expansion dictionary based on the medal dataset now:

from tqdm.auto import tqdm 
import pandas as pd

medal_df = pd.read_csv(path_to_full_data_csv)
medal_df.head()

loop_df = medal_df[:50000]
overall_article_abrbeviations_df = pd.DataFrame()

for index, row in tqdm(loop_df.iterrows(), total=len(loop_df)):

    start_positions = [int(x) for x in row["LOCATION"].split("|")]
    resolutions = row["LABEL"].split("|")
    abbreviations = []
    for start in range(len(start_positions)):
        abbreviations.append(row["TEXT"].split(" ")[start_positions[start]])

    this_article_abbreviations_df = pd.DataFrame()
    this_article_abbreviations_df["Abbreviations"]=abbreviations
    this_article_abbreviations_df["Resolutions"]=resolutions
    overall_article_abrbeviations_df = pd.concat([overall_article_abrbeviations_df,this_article_abbreviations_df])
    
abbreviation_df = overall_article_abrbeviations_df.groupby(['Abbreviations', 'Resolutions']).size().sort_values(ascending=False).reset_index(name='count') 

I inspected the most frequently occuring and least frequently occuring abbreviations and I suspect that some error happened during the reverse-substitution process for most of the frequently occuring abbreviations:

grafik

Hopefully you can help me understand these things a little bit better :)

Doubts in Expansion of Abbreviation in Coding

Hello BruceWen!

I'm looking exclusively for abbreviations expansions in sentences and I came across your github repo. I've few doubts from your work.

  1. I started using the pre-trained models one by one starting with LSTM ( lstm = torch.hub.load("BruceWen120/medal", "lstm").
    I want to disambiguate the abbreviations in top 10 sentences of test data downloaded from kaggle and later use my data. When I converted the test data to (batch=10 , sequence=230, dim=300) format using fast text on crawl-300d-2M-subword.bin and gave input to the model, I got output of 22555 dim (as given) for every data point. Can you tell me what is this output referring to ? What is the next step after getting 22555 dim outputs to get the expansions? From where and how can I get the expansions from these 22555 dim outputs?

  2. This tool of yours will disambiguate only single abbreviation in a sentence but not multiple as it location based. So, it is mandatory to have locations in test data. Can you tell me how did you detected and extracted the locations of abbreviations from a given text ? Because, this has to be done on my text data also.

Expecting the reply soon...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.