Code Monkey home page Code Monkey logo

Comments (3)

bfabiandev avatar bfabiandev commented on June 18, 2024 1

Hey, thanks!

  1. Where we note that we are doing finetuning we finetune the whole pre-trained encoder without freezing any layers.
  2. I think there is a confusion here, 0.323 is the performance for the best model and does not denote similarity, so I'm not sure I understand the question. Feel free to clarify! Thanks

from molbert.

LivC193 avatar LivC193 commented on June 18, 2024

Hi thank you for your fast answer. Sorry for the confusion I will try to explain what I mean.
As input for your model you used the dataset published here:
To generate the final dataset for the benchmarks, ChEMBL is post-processed by

  1. removal of salts.
  2. charge neutralization.
  3. removal of molecules with SMILES strings longer than 100 characters.
  4. removal of molecules containing any element other than H, B, C, N, O, F, Si, P, S, Cl, Se, Br, and I.
  5. removal of molecules with a larger ECFP4 similarity than 0.323 compared to a holdout set consisting of 10 marketed drugs (celecoxib, aripiprazole, cobimetinib, osimertinib, troglitazone, ranolazine, thiothixene, albuterol, fexofenadine, mestranol). This allows us to define similarity benchmarks for targets that are not part of the training set."

My question was referring to list item number 5 in the Data Set Generation. I assumed that for each molecule in your dataset you computed the similarity to those 10 drugs and if similarity was higher than 0.323 that molecule was discarded. I was curious how you selected this cutoff and what type of similarity was used.

As a follow up question: For your pre-trained model you set max_seq_length (smiles length) to 128, but in some tests you set it to 512. If I want to use your pre-trained model (max_seq_length = 128) to embed smiles longer than 128 characters can I simply change the max_seq_lenght argument or would that embedding be incorrect ?

from molbert.

JoshuaMeyers avatar JoshuaMeyers commented on June 18, 2024

Hi @LivC182, thanks for your interest in our work. Regarding threshold selection for similarity filtering in the GuacaMol training dataset, I can point you to reference (86) in the GuacaMol paper which is this blogpost, http://rdkit.blogspot.com/2013/10/fingerprint-thresholds.html. I believe the Tanimoto similarity was used (admittedly the relevant figure from the blog seems to have been transcribed as 0.323 instead of 0.321). This is in line with other suggested tanimoto thresholds for the ECFP4 fingerprints (e.g. here).

If you have any follow up questions regarding the training dataset, it might be worth asking in the GuacaMol repo (apologies for the slow response here).

Regarding the max_seq_length , we use relative positional encodings as described in Transformer-xl, which allows MolBERT to process sequences of arbitrary length at inference time, despite training with a fixed length vector. There is a caveat that MolBERT has not been trained for longer SMILES examples so we cannot guarantee that the model generalizes to SMILES of longer lengths, this would require further investigation. I would be interested in your experience with this also if you do try it out.

from molbert.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.