Hi, first of all congrats on your article and the NeurIPS workshop.

Hey, thanks! Where we note that we are doing finetuning we fin

Dataset size and creation about molbert HOT 3 OPEN

LivC193 commented on June 18, 2024

Dataset size and creation

from molbert.

Comments (3)

bfabiandev commented on June 18, 2024 1

Hey, thanks!

Where we note that we are doing finetuning we finetune the whole pre-trained encoder without freezing any layers.
I think there is a confusion here, 0.323 is the performance for the best model and does not denote similarity, so I'm not sure I understand the question. Feel free to clarify! Thanks

from molbert.

LivC193 commented on June 18, 2024

Hi thank you for your fast answer. Sorry for the confusion I will try to explain what I mean.
As input for your model you used the dataset published here:
To generate the final dataset for the benchmarks, ChEMBL is post-processed by

removal of salts.
charge neutralization.
removal of molecules with SMILES strings longer than 100 characters.
removal of molecules containing any element other than H, B, C, N, O, F, Si, P, S, Cl, Se, Br, and I.
removal of molecules with a larger ECFP4 similarity than 0.323 compared to a holdout set consisting of 10 marketed drugs (celecoxib, aripiprazole, cobimetinib, osimertinib, troglitazone, ranolazine, thiothixene, albuterol, fexofenadine, mestranol). This allows us to define similarity benchmarks for targets that are not part of the training set."

My question was referring to list item number 5 in the Data Set Generation. I assumed that for each molecule in your dataset you computed the similarity to those 10 drugs and if similarity was higher than 0.323 that molecule was discarded. I was curious how you selected this cutoff and what type of similarity was used.

As a follow up question: For your pre-trained model you set max_seq_length (smiles length) to 128, but in some tests you set it to 512. If I want to use your pre-trained model (max_seq_length = 128) to embed smiles longer than 128 characters can I simply change the max_seq_lenght argument or would that embedding be incorrect ?

from molbert.

JoshuaMeyers commented on June 18, 2024

Hi @LivC182, thanks for your interest in our work. Regarding threshold selection for similarity filtering in the GuacaMol training dataset, I can point you to reference (86) in the GuacaMol paper which is this blogpost, http://rdkit.blogspot.com/2013/10/fingerprint-thresholds.html. I believe the Tanimoto similarity was used (admittedly the relevant figure from the blog seems to have been transcribed as 0.323 instead of 0.321). This is in line with other suggested tanimoto thresholds for the ECFP4 fingerprints (e.g. here).

If you have any follow up questions regarding the training dataset, it might be worth asking in the GuacaMol repo (apologies for the slow response here).

Regarding the max_seq_length , we use relative positional encodings as described in Transformer-xl, which allows MolBERT to process sequences of arbitrary length at inference time, despite training with a fixed length vector. There is a caveat that MolBERT has not been trained for longer SMILES examples so we cannot guarantee that the model generalizes to SMILES of longer lengths, this would require further investigation. I would be interested in your experience with this also if you do try it out.

from molbert.

Dataset size and creation about molbert HOT 3 OPEN

Comments (3)

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent