ncbi-nlp / biosentvec Goto Github PK
View Code? Open in Web Editor NEWBioWordVec & BioSentVec: pre-trained embeddings for biomedical words and sentences
License: Other
BioWordVec & BioSentVec: pre-trained embeddings for biomedical words and sentences
License: Other
Hello,
I wanted to use context embeddings for words to perform word similarity tasks. Is there a way to get the context vectors for words using the fastText model file.
Thanks,
Aditya
I was trying to follow the given - BioSentVec_tutorial.ipynb
But after downloading the BioSentVec and trying to load I have got AttributeError.
Hi,
Thanks for this very valuable resource. I would like to know the month/year of downloading of the PubMed corpus used to train the BioWordVec models? That is, articles in PubMed till what day/month/year were used to build the BioWordVec models?
Thank you,
Mani
I used your pretrained word embedding and sent embedding on PubMed+MIMIC-III. And I would like to conduct my experiments on only PubMed corpus.
So, Can you sent me the pretrained word embedding and sent embedding only PubMed corpus.
Thank you for your consideration.
I am looking forward to hearing from you.
Hi all, thank for providing the embeddings for the BioSentVec model 21GB (700dim, trained on PubMed+MIMIC-III). Awesome work that helped me a lot!
I'd like to have the BioSentVec model trained only on the PubMed corpus. Did you train such a model too, or only the combined model with the corpora of PubMed+MIMIC-III?
I have tried the following model: BioSentVec model 21GB (700dim, trained on PubMed+MIMIC-III), but I have data from MIMIC-III in my test dataset and therefore a conflicting situation.
I appreciate an answer from you.
I just want to use the pretrained model of bio sent2vec in my local machine with almost 12 Gigabyte RAM, and the model size is almost 22 Gigabyte, and when I am trying to load model using sent2vec, it stuck and the result is nothing and the reason would be the model probably is completely loading and there is not enough memory.is there any way so that I can load the model under 12 GIgabyte RAM?
When you load it like this, will these be the top 4E5 words found during training? I believe other vector bins like Google News work like this. Thank you!
word2vec = gensim.models.KeyedVectors.load_word2vec_format(
'data/BioWordVec_PubMed_MIMICIII_d200.vec.bin',
binary=True,
limit=int(4E5) # faster load if you limit to most frequent terms?
)
I also hope to know the vocabulary size for all the other two pretrained ones. Thank you.
While exploring nearest neighbors, have seen many words which seem to be invalid words.
import fasttext
model = fasttext.load_model('./BioSentVec/models/BioWordVec_PubMed_MIMICIII_d200.bin')
model.get_nearest_neighbors('kidney', 30)
This gives the following output:
[(0.9160109162330627, u'kidney*'),
(0.9024562239646912, u'kidney=='),
(0.8989526033401489, u'kidneyks'),
(0.8850656747817993, u'kidney=5'),
(0.8817461133003235, u'kidney-kidney'),
(0.878646731376648, u'1kidney'),
(0.8774275183677673, u'2kidney'),
(0.87574702501297, u'kidney.6'),
(0.8753364682197571, u'qkidney'),
(0.8732652068138123, u'vkidney'),
(0.8732365369796753, u'kidney=48'),
(0.8726592659950256, u'kidneyx2'),
(0.8723018765449524, u'kidney.7'),
(0.8717607259750366, u'kidney2'),
(0.8697896003723145, u'kidneyys'),
(0.8693934679031372, u'kidneyl'),
(0.8692406415939331, u'kidneyds'),
(0.8683575987815857, u'lkidney'),
(0.8680075407028198, u'kidney*liver'),
(0.8666210174560547, u'kidney.5'),
(0.866155207157135, u'e1kidney'),
(0.8647593855857849, u'ckidney'),
(0.8646546006202698, u'ekidney'),
(0.8636531233787537, u'kidney~the'),
(0.861596941947937, u'kidney.2'),
(0.8599884510040283, u'dkidney'),
(0.8594629764556885, u'kidney.3'),
(0.8585801124572754, u'=kidney'),
(0.8581786155700684, u'vtkidney'),
(0.858029305934906, u'kidneywith')]
Wondering what are these words.
Are these coming from acupuncture points?
e.g. kidney2, kidney.2 - Do these represent http://www.acupuncture.com/education/points/kidney/kid2.htm ?
But when I use FastText's model, it returns expected nearest words:
model = fasttext.load_model('./models/cc.en.300.bin')
model.get_nearest_neighbors('kidney', 30)
[(0.7705090045928955, u'renal'),
(0.7571945786476135, u'kidneys'),
(0.7136564254760742, u'Kidney'),
(0.6960737109184265, u'kindey'),
(0.6932832598686218, u'liver'),
(0.6215611100196838, u'gallbladder'),
(0.6096128225326538, u'kidney-'),
(0.592450737953186, u'kidney-related'),
(0.5883890390396118, u'lung'),
(0.5875317454338074, u'Renal'),
(0.5851610898971558, u'kidney.'),
(0.580848753452301, u'dialysis'),
(0.5669795870780945, u'Kidneys'),
(0.565768301486969, u'pre-renal'),
(0.5617753267288208, u'hydronephrotic'),
(0.5602078437805176, u'non-renal'),
(0.5586943030357361, u'extra-renal'),
(0.557516872882843, u'ureter'),
(0.5568706393241882, u'hydronephrosis'),
(0.5556935667991638, u'nephrosis'),
(0.5507169961929321, u'extrarenal'),
(0.5478389859199524, u'bladder'),
(0.5455406904220581, u'nephritis'),
(0.540409505367279, u'pancreas'),
(0.538938045501709, u'gall-bladder'),
(0.5365235805511475, u'TEENney'),
(0.5338416695594788, u'pancreatic'),
(0.5323835611343384, u'ureteric'),
(0.5321975946426392, u'glomerular'),
(0.5308919548988342, u'prerenal')]
Thank you for the great embedding. I have a few questions on how to calculate the similarity of UMNSRS.
Per my understanding, BioWordVec is a word embedding. Each word is represented to a vector. However, there are some phrases that contain more than one word in UMNSRS. Did you get the average of each word in the phrase and then calculate the cosine similarity?
One more question is how do you deal with the words that are not in the vocabulary? e.g. I found:
(ana)
arthriits
buterfly
varicsoe
haletosis
are not in the vocabulary. Did you impute something or just discard those terms.
One more question is I found the window size in .sh is 30. You describe you were using 20 for extrinsic task. Which one will yield a better result?
Thank you and looking forward to reply!
will it work on windows
since
https://github.com/epfml/sent2vec
seems to be only work on linux
also can be some files used without
https://github.com/epfml/sent2vec
meaning only use from fasttext?
Hi
Interesting paper and approach, however I am kind of confused on how to reproduce the results on both datasets. More importantly, the paper mentions (as far as I understand) using 5 layer deep neural network trained on the embedding generated by BioSentVec. Isnt the dataset size too small for deep networks and is it possible to share the training code.
Hi everyone,
thank you for making this resource public; it is a great help to the community.
I am having issues loading the .bin (model) file with gensim. Code snippet follows:
from gensim.models.fasttext import *
gensim_fasttext_model = load_facebook_vectors(root_path+"models/pretrained/BioWordVec_PubMed_MIMICIII_d200.bin")
Although the above snippet uses CPU/RAM resources, it never loads the model (seems it is constantly loading) nor does it produce an error.
When I try to load it with the fastText library, it loads it within 90ish seconds. Code snippet:
import fastText as fasttext
fasttext.load_model(root_path+'models/pretrained/BioWordVec_PubMed_MIMICIII_d200.bin')
Unfortunatelly, I would prefer to use the gensim approach as it enables the getitem to generate representations (e.g. model['word_to_represent']).
I can load the vec.bin file (pretrained word-embedding mapping) with
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('BioWordVec_PubMed_MIMICIII_d200.vec.bin', binary=True)
but that does not help me in my current pipeline (as its not dealing with OOVs).
Could you provide any help/guidance what might be wrong? My env ist:
Linux-3.10.0-693.5.2.el7.x86_64-x86_64-with-centos-7.6.1810-Core
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0]
NumPy 1.16.2
SciPy 1.1.0
gensim 3.7.2
FAST_VERSION 1
I have Python 3.8.5 and installed fasttext, as well as sent2vec. But when I try to load the model, python crashes with a single message "Model file cannot be opened for loading!". Full installation code is:
conda create -n sent2vec "python==3.8.5"
conda activate sent2vec
pip install Cython
git clone https://github.com/facebookresearch/fastText.git
git clone https://github.com/epfml/sent2vec.git
cd fastText
pip install .
cd ../sent2vec
pip install .
Then, python code is:
import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model("./BioSentVec_PubMed_MIMICIII-bigram_d700.bin")
What could be wrong?
The following is not a bug in your code. Rather, wondering if any have thoughts on the following.
I'm working on some NLP tasks in oncology.
Have found that randomly initialized word/sentence embeddings tend to work better than any pretrained embeddings for ultimately classifying, say, improving vs worsening cancer.
Had an intuition that this might be because key words for telling the two apart tend to be embedded similarly.
In trying BioSentVec, this seems to be borne out, eg:
progression = model.embed_sentence(preprocess_sentence("Increase in size of tumor"))
response = model.embed_sentence(preprocess_sentence("Decrease in size of tumor"))
1 - distance.cosine(progression, response)
Yields: 0.94. Opposite meanings are embedded similarly, which explains why building classifiers on these embeddings does not work well.
Any methods for addressing this in the transfer learning setting that you're aware of? I have not found any---
Would it be possible for you to make your source corpora available (both raw & preprocessed (tokenized / sentence split)? Would be very useful in helping folks create resources with other methods.
is this the RAM, GPU-memory or Hard disk memory issue?
Unable to import BioWord2Vec with KeyedVectors. Also tried with Word2Vec. Then it is giving deprecation warnings. Please help as soon as possible.
I built a Tensorflow model on top. of BioSentVec Emebeddinngs. Now when I am trying to deploy the model, I need BioSentevec at inference time to preprocess the inputs.
I am trying to deploy. the model on AWS. using Lambda and EFS.
I have mounted the lambda on EFS and getting the following error when I try to load. the model -
terminate called after throwing an instance of 'std::bad_alloc'
Here is the Stackoverflow issue I have created - https://stackoverflow.com/questions/63817981/terminate-called-after-throwing-an-instance-of-stdbad-alloc-in-aws-efs-while?noredirect=1#comment112852352_63817981
Can someone guide me as to what is happening? Is this due to 3 GB Ram limitation on Lambda?
IF that is so, so the only option to deploy a model that uses BioSentvec would be to use an EC2 instance?
The MayoSRS and UMNSRS_similarity datasets mostly contain phrases. Did you use mean pooling to get the results you reported or some other pooling mechanism for ngrams longer than 1?
Thanks
Hi,
thank you for sharing this with the rest of us! Its already coming in handy ;)
Just a minor issue (and not sure if it is one); did you by mistake link the "opposite" files to respective links:
BioWordVec vector 13GB (200dim, trained on PubMed+MIMIC-III, word2vec bin format) -> downloads the bin file, size 27GB
BioWordVec model 26GB (200dim, trained on PubMed+MIMIC-III) -> downloadds the vec.bin file, size 13GBs.
Best,
J.
While both the paper and the README.md file mention window size of 20, the train_biowordvec.sh script uses -ws 30
.
What was the final window size used to produce the models?
It's really great that you have provided pretrained embeddings. For completeness, please also add the code written to generate them. It will serve as a useful technical example for someone to improve upon. Thanks.
I want to use the pretrained BioSentVec to extract sentence vector. I am following the following codes and ran into the error: "no module named 'sent2vec'". Do you know how to resolve this?
import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('model.bin')
emb = model.embed_sentence("once upon a time .")
I have done the following steps:
Hi,
This is a very good model for bio embedding, however, I need to add more train on my medical text dataset for further internal prediction. How can I do that?
Thanks
Where can I find the dataset?
When I calculated similarity between "disease causing" and "not disease causing" using BioSentVec. It gave "1" , but I think it should give close to zer0. Kindly have a look at following:
from scipy.spatial import distance
sentence_vector1 = model.embed_sentence(preprocess_sentence("disease causing") )
sentence_vector2 = model.embed_sentence(preprocess_sentence("not disease causing") )
cosine_sim = 1 - distance.cosine(sentence_vector1, sentence_vector2)
print( cosine_sim ) # this will print 1
NameError Traceback (most recent call last)
in
3 for line in pos:
4 print(line)
----> 5 sentence_vector = model.embed_sentence(line)
6 pos_arrays[i] = sentence_vector
7 pos_labels[i] = 1
src/sent2vec.pyx in sent2vec.Sent2vecModel.embed_sentence()
src/sent2vec.pyx in sent2vec.Sent2vecModel.embed_sentences()
src/sent2vec.pyx in sent2vec.vector_wrapper.asarray()
NameError: name 'stdvector_base' is not defined
I went to the code of stdvector_base and only see "pass" in the class definition. I added the default constructor but doesn't help to resolve the issue. Any suggestion? This April, I could run it successfully but it cannot work now.
Would be possible to find similar sentences using the sent2vec model?
How to use the BioSentVec model to query for similar sentences?
Hi
Where can I find the MEDSTS dataset? Also to evaluate the dataset on the BioSentVec model do I need any pre-processing?
Thanks
Abhishek
https://github.com/ncbi-nlp/BioSentVec/wiki#how-to-use-the-biowordvec-and-biosentvec-model
The BioWordVec is built upon sent2vec.
I guess, you meant to say BioSentVec.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.