samoturk / mol2vec Goto Github PK
View Code? Open in Web Editor NEWMol2vec - an unsupervised machine learning approach to learn vector representations of molecular substructures
License: BSD 3-Clause "New" or "Revised" License
Mol2vec - an unsupervised machine learning approach to learn vector representations of molecular substructures
License: BSD 3-Clause "New" or "Revised" License
I have a base64encoded string of ROMOL which needs to be converted back to it's original object form. I am using the base64encoded string to write the data into the text format, but after reading the text format I need to convert it back into it's object so that I can generate embeddings using the object via the function "mol2alt_sentence".
I am only able to save the base64encoded romol in string format instead of serialized object. Therefore, the functionality of converting the string back to the object is required. Please suggest any alternative solution for this issue.
Hi, first, thanks for making this great OSS library, much appreciated.
Im interested in taking the pretrained ZINC based model and further train it with sentences from my data set. The reason for this approach is that my dataset is small, only a few thousand SMILES. So Im trying to take a page from the CNN book where you can take a pretrained image recognition model and further specialized for your use case.
Is there any way to achieve that within your library?
Thank for providing the implementation. I have a question when I ran the code: when input various molecules, the mol2alt_sentence function seems to output encodings with various lengths thus the final DfVec also include multiple 300 dimension features. I am wondering how to aggregate them?
I am just using the standard pipeline but with different input molecules:
sentence=mol2alt_sentence(mol, radius=1) sentence_obj=MolSentence(sentence) mol_vec=DfVec(sentences2vec(sentence_obj, model, unseen='UNK'))
Thanks!
Feng
Hi !
We are very excited about using your project. We have the notebook samples working and we want to try on our own RDKIT molecules.
Unfortunately, we don't know how to convert RDKIT molecules into the morgan fingerprints that you are using as keys into the embedding dictionary. We can convert RDKIT molecules to bit vectors, but can't seem to match RDKIT molecules into your non-bit-vector representation (integers?).
Please advise !
Thanks !
It stops after running pip install git+https://github.com/samoturk/mol2vec and does not move forward
Hi Sam,
Is it possible to perform similarity search with the model ? Any ideas how could it be done ?
After applying mol2alt_sentence
to get the molecular sentence, is there any way to convert this back to the Mol object?
E.g. I have the sentence ['1016841875', '198706261', '2245384272', '2909042096', '2245384272', '2909042096', '1016841875', '198706261']
- can I convert this back to an rdkit.Chem.rdchem.Mol
object?
I have found the object mol2vec.helpers.IdentifierTable
but I'm unsure what's used for or if its helpful.
PS: the mol2vec project is a great implementation and very helpful for my research so far!
def sentences2vec(sentences, model, unseen=None):
"""Generate vectors for each sentence (list) in a list of sentences. Vector is simply a
sum of vectors for individual words.
Parameters
----------
sentences : list, array
List with sentences
model : word2vec.Word2Vec
Gensim word2vec model
unseen : None, str
Keyword for unseen words. If None, those words are skipped.
https://stats.stackexchange.com/questions/163005/how-to-set-the-dictionary-for-text-analysis-using-neural-networks/163032#163032
Returns
-------
np.array
"""
keys = set(model.wv.key_to_index)
vec = []
if unseen:
unseen_vec = model.wv.get_vector(unseen)
for sentence in sentences:
if unseen:
vec.append(sum([model.wv.get_vector(y) if y in set(sentence) & keys
else unseen_vec for y in sentence]))
else:
vec.append(sum([model.wv.get_vector(y) for y in sentence
if y in set(sentence) & keys]))
return np.array(vec)```
When running with Python 3.8, I get the following error message:
Featurizing molecules.
Traceback (most recent call last):
File "/Users/dkazempour/opt/anaconda3/bin/mol2vec", line 8, in
sys.exit(run())
File "/Users/dkazempour/opt/anaconda3/lib/python3.8/site-packages/mol2vec/app/mol2vec.py", line 165, in run
args.func(args)
File "/Users/dkazempour/opt/anaconda3/lib/python3.8/site-packages/mol2vec/app/mol2vec.py", line 25, in do_featurize
features.featurize(args.in_file, args.out_file, args.model, args.radius, args.uncommon)
File "/Users/dkazempour/opt/anaconda3/lib/python3.8/site-packages/mol2vec/features.py", line 465, in featurize
word2vec_model[uncommon]
TypeError: 'Word2Vec' object is not subscriptable
Which library is causing this issue?
Update: I recognized that my observation is related to the other issue titled "update sentences2vec function for gensim 4.0" by Maledive.
The causing library is gensim. Something has changed at the 4.x.x versions, which yields the above stated error.
A temporary 'fix' (actually a quite quick-n-dirty hack) is as follows: pip install -Iv gensim==3.8.2
Afterwards I could successfully run mol2vec again.
Hi,
I've done some similarity search experiments using your project mol2vec. Here is a draft of my blog. I'd appreciate if you could please read it and give me some feedback. Thank you!
draft link: https://medium.com/@sabrina.sj.ho/tanimoto-vs-mol2vec-7fa4af3208ef
Hi, first, thanks for making this great OSS library, much appreciated.
In the article, it is indicated that only the following elements are allowed to appear in the smiles molecule. Will lowercase letters be included? Some atoms Such as c,o,h,n....
I can't download Zinc15. May you provide a way to download it.
hi @samoturk ,
Are the pre-trained models available / will be made available in the future?
best,
Miha
Hi, Thanks for putting together the notebook to explore mol2vec. I am getting the following error however when loading the model using model = KeyedVectors.load('model_300dim.pkl')
AttributeError: 'Word2Vec' object has no attribute 'vocabulary'
I am using gensim v 3.3.0.
Also, I noticed there are 2 versions of the 'model_300dim.pkl' file, one is around 25 Mbs and another around 74 Mbs. Which one should be used? I have tried both versions and see the same error. Thanks for any help!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.