samoturk / mol2vec Goto Github PK

View Code? Open in Web Editor NEW

249.0 249.0 108.0 88.67 MB

Mol2vec - an unsupervised machine learning approach to learn vector representations of molecular substructures

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

cheminformatics machine-learning python

mol2vec's People

Contributors

Stargazers

Watchers

Forkers

biterbilen nkucodingcat tianflame afcarl amoliu fzamberlan dennissheberla thegodone fathyshalaby manzoorelahi lizhizhi7 nanomolar muu4649 zfang2019 unixjunkie 0x0all wujinkui cddt justinjjvanderhooft mwang87 ashwinhegde comocheng littleflow3r jenkescheen thntran qshao xybai-dev ashar799 michaelmaser xuzhang5788 xinhaoli74 antonsperera eehiter sparklingredstar ghostintheshellarise alperyilmaz xuanlin1991 ebcandir laserkelvin gunasekhar1420 lclindu pengyayuan rohorne07 u0m0z lionelxia xianmingzhang zhangdachuanfoodies orionisbio ammar257ammar lol88 minghao2016 askery yenson-lau colliner cchang373 tonyreina natnaelt kingscolour dot23 adenkics qcojuandavidmarin sailfish009 wsxy62 ws1997812 changx03 ayona123 omeygun dk-teknologisk-rtfh ratul2200 sushimin1744 chrinide pctskate rnaimehaom ehsanshahini mihohatanaka arrepath guyrosin ipark2021 garycao-45 yjsgcjdfz123 aolgac adiitya-dey gtaghon ouc6809 nasrinsaalehi mittalboi rbryan13 phuongnvp dandybit kostopr4v nataliyah123 aravindhnivas d1ffic00lt dylorone hendrixmar mmkuznecov suitst michieltaw gmaikelc alexxx0323

mol2vec's Issues

Convert back Base64encoded string of ROMol into an object of <rdkit.Chem.rdchem.Mol>

I have a base64encoded string of ROMOL which needs to be converted back to it's original object form. I am using the base64encoded string to write the data into the text format, but after reading the text format I need to convert it back into it's object so that I can generate embeddings using the object via the function "mol2alt_sentence".

I am only able to save the base64encoded romol in string format instead of serialized object. Therefore, the functionality of converting the string back to the object is required. Please suggest any alternative solution for this issue.

rdkit with tensorflow

Transfer learning

Hi, first, thanks for making this great OSS library, much appreciated.

Im interested in taking the pretrained ZINC based model and further train it with sentences from my data set. The reason for this approach is that my dataset is small, only a few thousand SMILES. So Im trying to take a page from the CNN book where you can take a pretrained image recognition model and further specialized for your use case.

Is there any way to achieve that within your library?

Feature size

Thank for providing the implementation. I have a question when I ran the code: when input various molecules, the mol2alt_sentence function seems to output encodings with various lengths thus the final DfVec also include multiple 300 dimension features. I am wondering how to aggregate them?

I am just using the standard pipeline but with different input molecules:
sentence=mol2alt_sentence(mol, radius=1) sentence_obj=MolSentence(sentence) mol_vec=DfVec(sentences2vec(sentence_obj, model, unseen='UNK'))

Thanks!

Feng

How Do You Convert RDKIT molecule to your fingerprint key ?

Hi !

We are very excited about using your project. We have the notebook samples working and we want to try on our own RDKIT molecules.

Unfortunately, we don't know how to convert RDKIT molecules into the morgan fingerprints that you are using as keys into the embedding dictionary. We can convert RDKIT molecules to bit vectors, but can't seem to match RDKIT molecules into your non-bit-vector representation (integers?).

Please advise !

Thanks !

Can't Clone it

It stops after running pip install git+https://github.com/samoturk/mol2vec and does not move forward

searching similar molecules

Hi Sam,

Is it possible to perform similarity search with the model ? Any ideas how could it be done ?

Convert sentences to mol

After applying mol2alt_sentence to get the molecular sentence, is there any way to convert this back to the Mol object?

E.g. I have the sentence ['1016841875', '198706261', '2245384272', '2909042096', '2245384272', '2909042096', '1016841875', '198706261'] - can I convert this back to an rdkit.Chem.rdchem.Mol object?

I have found the object mol2vec.helpers.IdentifierTable but I'm unsure what's used for or if its helpful.

PS: the mol2vec project is a great implementation and very helpful for my research so far!

update sentences2vec function for gensim 4.0

def sentences2vec(sentences, model, unseen=None):
    """Generate vectors for each sentence (list) in a list of sentences. Vector is simply a
    sum of vectors for individual words.
    
    Parameters
    ----------
    sentences : list, array
        List with sentences
    model : word2vec.Word2Vec
        Gensim word2vec model
    unseen : None, str
        Keyword for unseen words. If None, those words are skipped.
        https://stats.stackexchange.com/questions/163005/how-to-set-the-dictionary-for-text-analysis-using-neural-networks/163032#163032

    Returns
    -------
    np.array
    """
    
    keys = set(model.wv.key_to_index)
    vec = []
    
    if unseen:
        unseen_vec = model.wv.get_vector(unseen)

    for sentence in sentences:
        if unseen:
            vec.append(sum([model.wv.get_vector(y) if y in set(sentence) & keys
                       else unseen_vec for y in sentence]))
        else:
            vec.append(sum([model.wv.get_vector(y) for y in sentence 
                            if y in set(sentence) & keys]))
    return np.array(vec)```

Type Error when running with Python 3.8

When running with Python 3.8, I get the following error message:

Featurizing molecules.
Traceback (most recent call last):
File "/Users/dkazempour/opt/anaconda3/bin/mol2vec", line 8, in
sys.exit(run())
File "/Users/dkazempour/opt/anaconda3/lib/python3.8/site-packages/mol2vec/app/mol2vec.py", line 165, in run
args.func(args)
File "/Users/dkazempour/opt/anaconda3/lib/python3.8/site-packages/mol2vec/app/mol2vec.py", line 25, in do_featurize
features.featurize(args.in_file, args.out_file, args.model, args.radius, args.uncommon)
File "/Users/dkazempour/opt/anaconda3/lib/python3.8/site-packages/mol2vec/features.py", line 465, in featurize
word2vec_model[uncommon]
TypeError: 'Word2Vec' object is not subscriptable

Which library is causing this issue?

Update: I recognized that my observation is related to the other issue titled "update sentences2vec function for gensim 4.0" by Maledive.

The causing library is gensim. Something has changed at the 4.x.x versions, which yields the above stated error.

A temporary 'fix' (actually a quite quick-n-dirty hack) is as follows: pip install -Iv gensim==3.8.2
Afterwards I could successfully run mol2vec again.

blog: experiment using mol2vec

Hi,

I've done some similarity search experiments using your project mol2vec. Here is a draft of my blog. I'd appreciate if you could please read it and give me some feedback. Thank you!

draft link: https://medium.com/@sabrina.sj.ho/tanimoto-vs-mol2vec-7fa4af3208ef

ﬁlter criteria

Hi, first, thanks for making this great OSS library, much appreciated.

In the article, it is indicated that only the following elements are allowed to appear in the smiles molecule. Will lowercase letters be included? Some atoms Such as c,o,h,n....

I can't download Zinc15. May you provide a way to download it.

pretrained models

hi @samoturk ,

Are the pre-trained models available / will be made available in the future?

best,
Miha

Error when loading model

Hi, Thanks for putting together the notebook to explore mol2vec. I am getting the following error however when loading the model using model = KeyedVectors.load('model_300dim.pkl')

AttributeError: 'Word2Vec' object has no attribute 'vocabulary'

I am using gensim v 3.3.0.

Also, I noticed there are 2 versions of the 'model_300dim.pkl' file, one is around 25 Mbs and another around 74 Mbs. Which one should be used? I have tried both versions and see the same error. Thanks for any help!