ncbi-nlp / biosentvec Goto Github PK

BioWordVec & BioSentVec: pre-trained embeddings for biomedical words and sentences

License: Other

Shell 0.99% Python 19.66% Jupyter Notebook 79.35%

natural-language-processing bionlp fasttext sent2vec sentence-embeddings word-embeddings sentence-similarity pubmed mimic-iii

biosentvec's People

Contributors

Stargazers

Watchers

Forkers

ghostintheshellarise f-dx formulaone9275 qshuang123 lucian-whu codeaudit xiaoyu-z taojin1992 chuanhong michelole mac-kim hamedmx beckwang80 gonzo-likes a-idoc nicole-he elswob raghavgoyal14 americast aspirincode qingyu-qc sannpeterson arvingong96 tonydeep giantfurosemide dragomirradev markshope statdataanalyzer aashaybhupendradoshi minghao2016 hafsah2018 henry-nlp hakanaku1234 leonkei eddiebarry austinbehan pwforks armon-chen ppvastar laomagic yaxche-io yoken-mao fyjgreatlion maxwanglei muluayele999 odnodn tilsonar akshayonly naveenjr nlp-kg albertpenny kchennen carterwsmith mohamedmkaouar thangasami yueyedeai qingwu11 rocke2020 shicheng-guo blackkakapo ftahabi utkarshupd rahulchavhan lcagnina itsnamgyu chenmicheal jeekim swatisaivarma elahehaghaarabi hungvo304ml danielphamvt hertera1 xutianhan 1zineb pandinosaurus xiaoheng-zhang99 renespijker srravula1 sharpboy2008 tmacmilan andrewgcodes vinay-ba shrurastogi adrianferenc tomakk boibash shitoudidi zoule41 tantantan12 duzida ssusantachary healthmemmo tsu-ke mrphil xuanxie123 jjy961228

biosentvec's Issues

wiki page - Mention of BioWordVec instead of BioSentVec

@qingyu-chen

https://github.com/ncbi-nlp/BioSentVec/wiki#how-to-use-the-biowordvec-and-biosentvec-model

The BioWordVec is built upon sent2vec.

I guess, you meant to say BioSentVec.

find similar sentences

Would be possible to find similar sentences using the sent2vec model?
How to use the BioSentVec model to query for similar sentences?

The following is not a bug in your code. Rather, wondering if any have thoughts on the following.
I'm working on some NLP tasks in oncology.
Have found that randomly initialized word/sentence embeddings tend to work better than any pretrained embeddings for ultimately classifying, say, improving vs worsening cancer.
Had an intuition that this might be because key words for telling the two apart tend to be embedded similarly.

In trying BioSentVec, this seems to be borne out, eg:
progression = model.embed_sentence(preprocess_sentence("Increase in size of tumor"))
response = model.embed_sentence(preprocess_sentence("Decrease in size of tumor"))
1 - distance.cosine(progression, response)

Yields: 0.94. Opposite meanings are embedded similarly, which explains why building classifiers on these embeddings does not work well.

Any methods for addressing this in the transfer learning setting that you're aware of? I have not found any---

Unable to handle negation of sentences

When I calculated similarity between "disease causing" and "not disease causing" using BioSentVec. It gave "1" , but I think it should give close to zer0. Kindly have a look at following:

from scipy.spatial import distance
sentence_vector1 = model.embed_sentence(preprocess_sentence("disease causing")  )
sentence_vector2 = model.embed_sentence(preprocess_sentence("not disease causing")  )
cosine_sim = 1 - distance.cosine(sentence_vector1, sentence_vector2)
print( cosine_sim )  # this will print 1

"Model file cannot be opened for loading!" for BioSentVec

I have Python 3.8.5 and installed fasttext, as well as sent2vec. But when I try to load the model, python crashes with a single message "Model file cannot be opened for loading!". Full installation code is:

conda create -n sent2vec "python==3.8.5"
conda activate sent2vec
pip install Cython
git clone https://github.com/facebookresearch/fastText.git
git clone https://github.com/epfml/sent2vec.git
cd fastText
pip install .
cd ../sent2vec
pip install .

Then, python code is:

import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model("./BioSentVec_PubMed_MIMICIII-bigram_d700.bin")

What could be wrong?

Question on calculate the similarity of UMNSRS?

Thank you for the great embedding. I have a few questions on how to calculate the similarity of UMNSRS.

Per my understanding, BioWordVec is a word embedding. Each word is represented to a vector. However, there are some phrases that contain more than one word in UMNSRS. Did you get the average of each word in the phrase and then calculate the cosine similarity?

One more question is how do you deal with the words that are not in the vocabulary? e.g. I found:

(ana)
arthriits
buterfly
varicsoe
haletosis

are not in the vocabulary. Did you impute something or just discard those terms.

One more question is I found the window size in .sh is 30. You describe you were using 20 for extrinsic task. Which one will yield a better result?

Thank you and looking forward to reply!

What is the exact vocabulary size for BioSentVec model 21GB (700dim, trained on PubMed+MIMIC-III)?

I also hope to know the vocabulary size for all the other two pretrained ones. Thank you.

PubMed corpus only trained model for BioSentVec

Hi all, thank for providing the embeddings for the BioSentVec model 21GB (700dim, trained on PubMed+MIMIC-III). Awesome work that helped me a lot!

I'd like to have the BioSentVec model trained only on the PubMed corpus. Did you train such a model too, or only the combined model with the corpora of PubMed+MIMIC-III?

I have tried the following model: BioSentVec model 21GB (700dim, trained on PubMed+MIMIC-III), but I have data from MIMIC-III in my test dataset and therefore a conflicting situation.

I appreciate an answer from you.

Invalid words in vocabulary?

While exploring nearest neighbors, have seen many words which seem to be invalid words.

import fasttext
model = fasttext.load_model('./BioSentVec/models/BioWordVec_PubMed_MIMICIII_d200.bin')
model.get_nearest_neighbors('kidney', 30)

This gives the following output:

[(0.9160109162330627, u'kidney*'),
(0.9024562239646912, u'kidney=='),
(0.8989526033401489, u'kidneyks'),
(0.8850656747817993, u'kidney=5'),
(0.8817461133003235, u'kidney-kidney'),
(0.878646731376648, u'1kidney'),
(0.8774275183677673, u'2kidney'),
(0.87574702501297, u'kidney.6'),
(0.8753364682197571, u'qkidney'),
(0.8732652068138123, u'vkidney'),
(0.8732365369796753, u'kidney=48'),
(0.8726592659950256, u'kidneyx2'),
(0.8723018765449524, u'kidney.7'),
(0.8717607259750366, u'kidney2'),
(0.8697896003723145, u'kidneyys'),
(0.8693934679031372, u'kidneyl'),
(0.8692406415939331, u'kidneyds'),
(0.8683575987815857, u'lkidney'),
(0.8680075407028198, u'kidney*liver'),
(0.8666210174560547, u'kidney.5'),
(0.866155207157135, u'e1kidney'),
(0.8647593855857849, u'ckidney'),
(0.8646546006202698, u'ekidney'),
(0.8636531233787537, u'kidney~the'),
(0.861596941947937, u'kidney.2'),
(0.8599884510040283, u'dkidney'),
(0.8594629764556885, u'kidney.3'),
(0.8585801124572754, u'=kidney'),
(0.8581786155700684, u'vtkidney'),
(0.858029305934906, u'kidneywith')]

Wondering what are these words.
Are these coming from acupuncture points?
e.g. kidney2, kidney.2 - Do these represent http://www.acupuncture.com/education/points/kidney/kid2.htm ?

Even if that's the case, is it correct to generate words from the phrase kidney 2 ?
Or is that pre-processing wasn't done properly?

But when I use FastText's model, it returns expected nearest words:

model = fasttext.load_model('./models/cc.en.300.bin')
model.get_nearest_neighbors('kidney', 30)

[(0.7705090045928955, u'renal'),
(0.7571945786476135, u'kidneys'),
(0.7136564254760742, u'Kidney'),
(0.6960737109184265, u'kindey'),
(0.6932832598686218, u'liver'),
(0.6215611100196838, u'gallbladder'),
(0.6096128225326538, u'kidney-'),
(0.592450737953186, u'kidney-related'),
(0.5883890390396118, u'lung'),
(0.5875317454338074, u'Renal'),
(0.5851610898971558, u'kidney.'),
(0.580848753452301, u'dialysis'),
(0.5669795870780945, u'Kidneys'),
(0.565768301486969, u'pre-renal'),
(0.5617753267288208, u'hydronephrotic'),
(0.5602078437805176, u'non-renal'),
(0.5586943030357361, u'extra-renal'),
(0.557516872882843, u'ureter'),
(0.5568706393241882, u'hydronephrosis'),
(0.5556935667991638, u'nephrosis'),
(0.5507169961929321, u'extrarenal'),
(0.5478389859199524, u'bladder'),
(0.5455406904220581, u'nephritis'),
(0.540409505367279, u'pancreas'),
(0.538938045501709, u'gall-bladder'),
(0.5365235805511475, u'TEENney'),
(0.5338416695594788, u'pancreatic'),
(0.5323835611343384, u'ureteric'),
(0.5321975946426392, u'glomerular'),
(0.5308919548988342, u'prerenal')]

Context Vectors for words

Hello,

I wanted to use context embeddings for words to perform word similarity tasks. Is there a way to get the context vectors for words using the fastText model file.

Thanks,
Aditya

Where can I download the word2vec and Sent2vec which are trained on PubMe dataset

I used your pretrained word embedding and sent embedding on PubMed+MIMIC-III. And I would like to conduct my experiments on only PubMed corpus.
So, Can you sent me the pretrained word embedding and sent embedding only PubMed corpus.
Thank you for your consideration.
I am looking forward to hearing from you.

gensim - read bin file?

Hi everyone,

thank you for making this resource public; it is a great help to the community.

I am having issues loading the .bin (model) file with gensim. Code snippet follows:

from gensim.models.fasttext import *
gensim_fasttext_model = load_facebook_vectors(root_path+"models/pretrained/BioWordVec_PubMed_MIMICIII_d200.bin")

Although the above snippet uses CPU/RAM resources, it never loads the model (seems it is constantly loading) nor does it produce an error.

When I try to load it with the fastText library, it loads it within 90ish seconds. Code snippet:

import fastText as fasttext
fasttext.load_model(root_path+'models/pretrained/BioWordVec_PubMed_MIMICIII_d200.bin')

Unfortunatelly, I would prefer to use the gensim approach as it enables the getitem to generate representations (e.g. model['word_to_represent']).

I can load the vec.bin file (pretrained word-embedding mapping) with
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('BioWordVec_PubMed_MIMICIII_d200.vec.bin', binary=True)

but that does not help me in my current pipeline (as its not dealing with OOVs).

Could you provide any help/guidance what might be wrong? My env ist:

Linux-3.10.0-693.5.2.el7.x86_64-x86_64-with-centos-7.6.1810-Core
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0]
NumPy 1.16.2
SciPy 1.1.0
gensim 3.7.2
FAST_VERSION 1

BIOSSES and MEDSTS results

Hi
Interesting paper and approach, however I am kind of confused on how to reproduce the results on both datasets. More importantly, the paper mentions (as far as I understand) using 5 layer deep neural network trained on the embedding generated by BioSentVec. Isnt the dataset size too small for deep networks and is it possible to share the training code.

Inconsistent window size

While both the paper and the README.md file mention window size of 20, the train_biowordvec.sh script uses -ws 30.

What was the final window size used to produce the models?

Months/Year of the PubMed corpus?

Hi,
Thanks for this very valuable resource. I would like to know the month/year of downloading of the PubMed corpus used to train the BioWordVec models? That is, articles in PubMed till what day/month/year were used to build the BioWordVec models?

Thank you,
Mani

BioWordVec - how to handle phrases

The MayoSRS and UMNSRS_similarity datasets mostly contain phrases. Did you use mean pooling to get the results you reported or some other pooling mechanism for ngrams longer than 1?

Thanks

name 'stdvector_base' is not defined when calling sent2vec.Sent2vecModel.embed_sentence()

NameError Traceback (most recent call last)
in
3 for line in pos:
4 print(line)
----> 5 sentence_vector = model.embed_sentence(line)
6 pos_arrays[i] = sentence_vector
7 pos_labels[i] = 1

src/sent2vec.pyx in sent2vec.Sent2vecModel.embed_sentence()

src/sent2vec.pyx in sent2vec.Sent2vecModel.embed_sentences()

src/sent2vec.pyx in sent2vec.vector_wrapper.asarray()

NameError: name 'stdvector_base' is not defined

I went to the code of stdvector_base and only see "pass" in the class definition. I added the default constructor but doesn't help to resolve the issue. Any suggestion? This April, I could run it successfully but it cannot work now.

Make source corpora available

Would it be possible for you to make your source corpora available (both raw & preprocessed (tokenized / sentence split)? Would be very useful in helping folks create resources with other methods.

How to import sent2vec

I want to use the pretrained BioSentVec to extract sentence vector. I am following the following codes and ran into the error: "no module named 'sent2vec'". Do you know how to resolve this?

import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('model.bin')
emb = model.embed_sentence("once upon a time .")

I have done the following steps:

download the pretrained model BioSentVec: BioSentVec_PubMed_MIMICIII-bigram_d700.bin
installed fasttext as below (fasttext was installed successfully)
$ wget https://github.com/facebookresearch/fastText/archive/v0.9.1.zip
$ unzip v0.9.1.zip
$ cd fastText-0.9.1
$ make
$pip install .

Add generation code

It's really great that you have provided pretrained embeddings. For completeness, please also add the code written to generate them. It will serve as a useful technical example for someone to improve upon. Thanks.

How can I load the bio sent2vec model due to loss of RAM?

I just want to use the pretrained model of bio sent2vec in my local machine with almost 12 Gigabyte RAM, and the model size is almost 22 Gigabyte, and when I am trying to load model using sent2vec, it stuck and the result is nothing and the reason would be the model probably is completely loading and there is not enough memory.is there any way so that I can load the model under 12 GIgabyte RAM?

Links to vec/model file wrong?

Hi,

thank you for sharing this with the rest of us! Its already coming in handy ;)

Just a minor issue (and not sure if it is one); did you by mistake link the "opposite" files to respective links:

BioWordVec vector 13GB (200dim, trained on PubMed+MIMIC-III, word2vec bin format) -> downloads the bin file, size 27GB

BioWordVec model 26GB (200dim, trained on PubMed+MIMIC-III) -> downloadds the vec.bin file, size 13GBs.

Best,
J.

How to transfer learning or additional train with custom medical dataset

Hi,
This is a very good model for bio embedding, however, I need to add more train on my medical text dataset for further internal prediction. How can I do that?

Thanks

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

Unable to import BioWord2Vec with KeyedVectors. Also tried with Word2Vec. Then it is giving deprecation warnings. Please help as soon as possible.

MedSTS dataset

Where can I find the MEDSTS dataset? Also to evaluate the dataset on the BioSentVec model do I need any pre-processing?

Thanks
Abhishek

Question: when you specify limit, does it start with most frequently found words?

When you load it like this, will these be the top 4E5 words found during training? I believe other vector bins like Google News work like this. Thank you!

word2vec = gensim.models.KeyedVectors.load_word2vec_format(
                'data/BioWordVec_PubMed_MIMICIII_d200.vec.bin',
                binary=True,
                limit=int(4E5) # faster load if you limit to most frequent terms?
            )

terminate called after throwing an instance of 'std::bad_alloc' in AWS EFS while loading BioSentVec

I built a Tensorflow model on top. of BioSentVec Emebeddinngs. Now when I am trying to deploy the model, I need BioSentevec at inference time to preprocess the inputs.

I am trying to deploy. the model on AWS. using Lambda and EFS.

I have mounted the lambda on EFS and getting the following error when I try to load. the model -

terminate called after throwing an instance of 'std::bad_alloc'

Here is the Stackoverflow issue I have created - https://stackoverflow.com/questions/63817981/terminate-called-after-throwing-an-instance-of-stdbad-alloc-in-aws-efs-while?noredirect=1#comment112852352_63817981

Can someone guide me as to what is happening? Is this due to 3 GB Ram limitation on Lambda?

IF that is so, so the only option to deploy a model that uses BioSentvec would be to use an EC2 instance?

meaning only use from fasttext?

terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)

is this the RAM, GPU-memory or Hard disk memory issue?