Comments (3)
Hey, thanks!
- Where we note that we are doing finetuning we finetune the whole pre-trained encoder without freezing any layers.
- I think there is a confusion here, 0.323 is the performance for the best model and does not denote similarity, so I'm not sure I understand the question. Feel free to clarify! Thanks
from molbert.
Hi thank you for your fast answer. Sorry for the confusion I will try to explain what I mean.
As input for your model you used the dataset published here:
To generate the final dataset for the benchmarks, ChEMBL is post-processed by
- removal of salts.
- charge neutralization.
- removal of molecules with SMILES strings longer than 100 characters.
- removal of molecules containing any element other than H, B, C, N, O, F, Si, P, S, Cl, Se, Br, and I.
- removal of molecules with a larger ECFP4 similarity than 0.323 compared to a holdout set consisting of 10 marketed drugs (celecoxib, aripiprazole, cobimetinib, osimertinib, troglitazone, ranolazine, thiothixene, albuterol, fexofenadine, mestranol). This allows us to define similarity benchmarks for targets that are not part of the training set."
My question was referring to list item number 5 in the Data Set Generation. I assumed that for each molecule in your dataset you computed the similarity to those 10 drugs and if similarity was higher than 0.323 that molecule was discarded. I was curious how you selected this cutoff and what type of similarity was used.
As a follow up question: For your pre-trained model you set max_seq_length (smiles length) to 128, but in some tests you set it to 512. If I want to use your pre-trained model (max_seq_length = 128) to embed smiles longer than 128 characters can I simply change the max_seq_lenght argument or would that embedding be incorrect ?
from molbert.
Hi @LivC182, thanks for your interest in our work. Regarding threshold selection for similarity filtering in the GuacaMol training dataset, I can point you to reference (86) in the GuacaMol paper which is this blogpost, http://rdkit.blogspot.com/2013/10/fingerprint-thresholds.html. I believe the Tanimoto similarity was used (admittedly the relevant figure from the blog seems to have been transcribed as 0.323 instead of 0.321). This is in line with other suggested tanimoto thresholds for the ECFP4 fingerprints (e.g. here).
If you have any follow up questions regarding the training dataset, it might be worth asking in the GuacaMol repo (apologies for the slow response here).
Regarding the max_seq_length
, we use relative positional encodings as described in Transformer-xl, which allows MolBERT to process sequences of arbitrary length at inference time, despite training with a fixed length vector. There is a caveat that MolBERT has not been trained for longer SMILES examples so we cannot guarantee that the model generalizes to SMILES of longer lengths, this would require further investigation. I would be interested in your experience with this also if you do try it out.
from molbert.
Related Issues (8)
- molbertfeaturizer with a finetuned model HOT 1
- symbol and char in the elements.txt HOT 3
- Reproducing MolBERT results on QSAR tasks HOT 2
- pip install does not include utils/data directory HOT 2
- Loading finetuned model from checkpoint for inference
- Install error
- Feature extraction error: no file named "elements.txt" HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from molbert.