Code Monkey home page Code Monkey logo

biosentvec's People

Contributors

kaushikacharya avatar qingyu-chen avatar qingyu-qc avatar yfpeng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

biosentvec's Issues

Context Vectors for words

Hello,

I wanted to use context embeddings for words to perform word similarity tasks. Is there a way to get the context vectors for words using the fastText model file.

Thanks,
Aditya

Months/Year of the PubMed corpus?

Hi,
Thanks for this very valuable resource. I would like to know the month/year of downloading of the PubMed corpus used to train the BioWordVec models? That is, articles in PubMed till what day/month/year were used to build the BioWordVec models?

Thank you,
Mani

PubMed corpus only trained model for BioSentVec

Hi all, thank for providing the embeddings for the BioSentVec model 21GB (700dim, trained on PubMed+MIMIC-III). Awesome work that helped me a lot!

I'd like to have the BioSentVec model trained only on the PubMed corpus. Did you train such a model too, or only the combined model with the corpora of PubMed+MIMIC-III?

I have tried the following model: BioSentVec model 21GB (700dim, trained on PubMed+MIMIC-III), but I have data from MIMIC-III in my test dataset and therefore a conflicting situation.

I appreciate an answer from you.

How can I load the bio sent2vec model due to loss of RAM?

I just want to use the pretrained model of bio sent2vec in my local machine with almost 12 Gigabyte RAM, and the model size is almost 22 Gigabyte, and when I am trying to load model using sent2vec, it stuck and the result is nothing and the reason would be the model probably is completely loading and there is not enough memory.is there any way so that I can load the model under 12 GIgabyte RAM?

Question: when you specify limit, does it start with most frequently found words?

When you load it like this, will these be the top 4E5 words found during training? I believe other vector bins like Google News work like this. Thank you!

word2vec = gensim.models.KeyedVectors.load_word2vec_format(
                'data/BioWordVec_PubMed_MIMICIII_d200.vec.bin',
                binary=True,
                limit=int(4E5) # faster load if you limit to most frequent terms?
            )

Invalid words in vocabulary?

While exploring nearest neighbors, have seen many words which seem to be invalid words.

import fasttext
model = fasttext.load_model('./BioSentVec/models/BioWordVec_PubMed_MIMICIII_d200.bin')
model.get_nearest_neighbors('kidney', 30)

This gives the following output:

[(0.9160109162330627, u'kidney*'),
(0.9024562239646912, u'kidney=='),
(0.8989526033401489, u'kidneyks'),
(0.8850656747817993, u'kidney=5'),
(0.8817461133003235, u'kidney-kidney'),
(0.878646731376648, u'1kidney'),
(0.8774275183677673, u'2kidney'),
(0.87574702501297, u'kidney.6'),
(0.8753364682197571, u'qkidney'),
(0.8732652068138123, u'vkidney'),
(0.8732365369796753, u'kidney=48'),
(0.8726592659950256, u'kidneyx2'),
(0.8723018765449524, u'kidney.7'),
(0.8717607259750366, u'kidney2'),
(0.8697896003723145, u'kidneyys'),
(0.8693934679031372, u'kidneyl'),
(0.8692406415939331, u'kidneyds'),
(0.8683575987815857, u'lkidney'),
(0.8680075407028198, u'kidney*liver'),
(0.8666210174560547, u'kidney.5'),
(0.866155207157135, u'e1kidney'),
(0.8647593855857849, u'ckidney'),
(0.8646546006202698, u'ekidney'),
(0.8636531233787537, u'kidney~the'),
(0.861596941947937, u'kidney.2'),
(0.8599884510040283, u'dkidney'),
(0.8594629764556885, u'kidney.3'),
(0.8585801124572754, u'=kidney'),
(0.8581786155700684, u'vtkidney'),
(0.858029305934906, u'kidneywith')]

Wondering what are these words.
Are these coming from acupuncture points?
e.g. kidney2, kidney.2 - Do these represent http://www.acupuncture.com/education/points/kidney/kid2.htm ?

  • Even if that's the case, is it correct to generate words from the phrase kidney 2 ?
  • Or is that pre-processing wasn't done properly?

But when I use FastText's model, it returns expected nearest words:

model = fasttext.load_model('./models/cc.en.300.bin')
model.get_nearest_neighbors('kidney', 30)
[(0.7705090045928955, u'renal'),
(0.7571945786476135, u'kidneys'),
(0.7136564254760742, u'Kidney'),
(0.6960737109184265, u'kindey'),
(0.6932832598686218, u'liver'),
(0.6215611100196838, u'gallbladder'),
(0.6096128225326538, u'kidney-'),
(0.592450737953186, u'kidney-related'),
(0.5883890390396118, u'lung'),
(0.5875317454338074, u'Renal'),
(0.5851610898971558, u'kidney.'),
(0.580848753452301, u'dialysis'),
(0.5669795870780945, u'Kidneys'),
(0.565768301486969, u'pre-renal'),
(0.5617753267288208, u'hydronephrotic'),
(0.5602078437805176, u'non-renal'),
(0.5586943030357361, u'extra-renal'),
(0.557516872882843, u'ureter'),
(0.5568706393241882, u'hydronephrosis'),
(0.5556935667991638, u'nephrosis'),
(0.5507169961929321, u'extrarenal'),
(0.5478389859199524, u'bladder'),
(0.5455406904220581, u'nephritis'),
(0.540409505367279, u'pancreas'),
(0.538938045501709, u'gall-bladder'),
(0.5365235805511475, u'TEENney'),
(0.5338416695594788, u'pancreatic'),
(0.5323835611343384, u'ureteric'),
(0.5321975946426392, u'glomerular'),
(0.5308919548988342, u'prerenal')]

Question on calculate the similarity of UMNSRS?

Thank you for the great embedding. I have a few questions on how to calculate the similarity of UMNSRS.

Per my understanding, BioWordVec is a word embedding. Each word is represented to a vector. However, there are some phrases that contain more than one word in UMNSRS. Did you get the average of each word in the phrase and then calculate the cosine similarity?

One more question is how do you deal with the words that are not in the vocabulary? e.g. I found:

(ana)
arthriits
buterfly
varicsoe
haletosis

are not in the vocabulary. Did you impute something or just discard those terms.

One more question is I found the window size in .sh is 30. You describe you were using 20 for extrinsic task. Which one will yield a better result?

Thank you and looking forward to reply!

BIOSSES and MEDSTS results

Hi
Interesting paper and approach, however I am kind of confused on how to reproduce the results on both datasets. More importantly, the paper mentions (as far as I understand) using 5 layer deep neural network trained on the embedding generated by BioSentVec. Isnt the dataset size too small for deep networks and is it possible to share the training code.

gensim - read bin file?

Hi everyone,

thank you for making this resource public; it is a great help to the community.

I am having issues loading the .bin (model) file with gensim. Code snippet follows:

from gensim.models.fasttext import *
gensim_fasttext_model = load_facebook_vectors(root_path+"models/pretrained/BioWordVec_PubMed_MIMICIII_d200.bin")

Although the above snippet uses CPU/RAM resources, it never loads the model (seems it is constantly loading) nor does it produce an error.

When I try to load it with the fastText library, it loads it within 90ish seconds. Code snippet:

import fastText as fasttext
fasttext.load_model(root_path+'models/pretrained/BioWordVec_PubMed_MIMICIII_d200.bin')

Unfortunatelly, I would prefer to use the gensim approach as it enables the getitem to generate representations (e.g. model['word_to_represent']).

I can load the vec.bin file (pretrained word-embedding mapping) with
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('BioWordVec_PubMed_MIMICIII_d200.vec.bin', binary=True)

but that does not help me in my current pipeline (as its not dealing with OOVs).

Could you provide any help/guidance what might be wrong? My env ist:

Linux-3.10.0-693.5.2.el7.x86_64-x86_64-with-centos-7.6.1810-Core
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0]
NumPy 1.16.2
SciPy 1.1.0
gensim 3.7.2
FAST_VERSION 1

"Model file cannot be opened for loading!" for BioSentVec

I have Python 3.8.5 and installed fasttext, as well as sent2vec. But when I try to load the model, python crashes with a single message "Model file cannot be opened for loading!". Full installation code is:

conda create -n sent2vec "python==3.8.5"
conda activate sent2vec
pip install Cython
git clone https://github.com/facebookresearch/fastText.git
git clone https://github.com/epfml/sent2vec.git
cd fastText
pip install .
cd ../sent2vec
pip install .

Then, python code is:

import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model("./BioSentVec_PubMed_MIMICIII-bigram_d700.bin")

What could be wrong?

Medical antonyms

The following is not a bug in your code. Rather, wondering if any have thoughts on the following.
I'm working on some NLP tasks in oncology.
Have found that randomly initialized word/sentence embeddings tend to work better than any pretrained embeddings for ultimately classifying, say, improving vs worsening cancer.
Had an intuition that this might be because key words for telling the two apart tend to be embedded similarly.

In trying BioSentVec, this seems to be borne out, eg:
progression = model.embed_sentence(preprocess_sentence("Increase in size of tumor"))
response = model.embed_sentence(preprocess_sentence("Decrease in size of tumor"))
1 - distance.cosine(progression, response)

Yields: 0.94. Opposite meanings are embedded similarly, which explains why building classifiers on these embeddings does not work well.

Any methods for addressing this in the transfer learning setting that you're aware of? I have not found any---

Make source corpora available

Would it be possible for you to make your source corpora available (both raw & preprocessed (tokenized / sentence split)? Would be very useful in helping folks create resources with other methods.

terminate called after throwing an instance of 'std::bad_alloc' in AWS EFS while loading BioSentVec

I built a Tensorflow model on top. of BioSentVec Emebeddinngs. Now when I am trying to deploy the model, I need BioSentevec at inference time to preprocess the inputs.

I am trying to deploy. the model on AWS. using Lambda and EFS.

I have mounted the lambda on EFS and getting the following error when I try to load. the model -

terminate called after throwing an instance of 'std::bad_alloc'

Here is the Stackoverflow issue I have created - https://stackoverflow.com/questions/63817981/terminate-called-after-throwing-an-instance-of-stdbad-alloc-in-aws-efs-while?noredirect=1#comment112852352_63817981

Can someone guide me as to what is happening? Is this due to 3 GB Ram limitation on Lambda?

IF that is so, so the only option to deploy a model that uses BioSentvec would be to use an EC2 instance?

BioWordVec - how to handle phrases

The MayoSRS and UMNSRS_similarity datasets mostly contain phrases. Did you use mean pooling to get the results you reported or some other pooling mechanism for ngrams longer than 1?

Thanks

Links to vec/model file wrong?

Hi,

thank you for sharing this with the rest of us! Its already coming in handy ;)

Just a minor issue (and not sure if it is one); did you by mistake link the "opposite" files to respective links:

BioWordVec vector 13GB (200dim, trained on PubMed+MIMIC-III, word2vec bin format) -> downloads the bin file, size 27GB

BioWordVec model 26GB (200dim, trained on PubMed+MIMIC-III) -> downloadds the vec.bin file, size 13GBs.

Best,
J.

Add generation code

It's really great that you have provided pretrained embeddings. For completeness, please also add the code written to generate them. It will serve as a useful technical example for someone to improve upon. Thanks.

How to import sent2vec

I want to use the pretrained BioSentVec to extract sentence vector. I am following the following codes and ran into the error: "no module named 'sent2vec'". Do you know how to resolve this?

import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('model.bin')
emb = model.embed_sentence("once upon a time .")

I have done the following steps:

MedSTS

Where can I find the dataset?

Unable to handle negation of sentences

When I calculated similarity between "disease causing" and "not disease causing" using BioSentVec. It gave "1" , but I think it should give close to zer0. Kindly have a look at following:

from scipy.spatial import distance
sentence_vector1 = model.embed_sentence(preprocess_sentence("disease causing")  )
sentence_vector2 = model.embed_sentence(preprocess_sentence("not disease causing")  )
cosine_sim = 1 - distance.cosine(sentence_vector1, sentence_vector2)
print( cosine_sim )  # this will print 1

name 'stdvector_base' is not defined when calling sent2vec.Sent2vecModel.embed_sentence()

NameError Traceback (most recent call last)
in
3 for line in pos:
4 print(line)
----> 5 sentence_vector = model.embed_sentence(line)
6 pos_arrays[i] = sentence_vector
7 pos_labels[i] = 1

src/sent2vec.pyx in sent2vec.Sent2vecModel.embed_sentence()

src/sent2vec.pyx in sent2vec.Sent2vecModel.embed_sentences()

src/sent2vec.pyx in sent2vec.vector_wrapper.asarray()

NameError: name 'stdvector_base' is not defined

I went to the code of stdvector_base and only see "pass" in the class definition. I added the default constructor but doesn't help to resolve the issue. Any suggestion? This April, I could run it successfully but it cannot work now.

find similar sentences

Would be possible to find similar sentences using the sent2vec model?
How to use the BioSentVec model to query for similar sentences?

MedSTS dataset

Hi

Where can I find the MEDSTS dataset? Also to evaluate the dataset on the BioSentVec model do I need any pre-processing?

Thanks
Abhishek

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.