coleygroup / molpal Goto Github PK

View Code? Open in Web Editor NEW

152.0 6.0 34.0 255.06 MB

active learning for accelerated high-throughput virtual screening

License: MIT License

Python 27.23% Shell 0.14% Jupyter Notebook 72.63%

active-learning virtual-screening chemistry

molpal's People

Contributors

Stargazers

Watchers

molpal's Issues

bug in fingerprints.py

Nice work!

i think i found a bug in fingerprints.py.

in line 222, " parser.add_argument('-l', '--libraries', required=True, nargs='+',...)",

but later, in line 235, " path = args.path or Path(args.library).parent".

libraries != library

what's your take on random sampling?

Hi,

In molpal you used Bayesian optimization to do sampling.

however, in some other approaches that just use random sampling, and achieved good results. for instance,

Gentile F, Agrawal V, Hsing M, et al. Deep docking: a deep learning platform for augmentation of structure based drug discovery[J]. ACS central science, 2020, 6(6): 939-949.
Martin L. State of the art iterative docking with logistic regression and Morgan fingerprints[J]. 2021.

could you please have some comments on that? is BO a necessary？

[QUESTION]: Small dataset for training

What are you trying to do?
My training data is small (19 observations) and my lookup pool is only 1500 structures. I am wondering how I should set up the various parameters in the config file to account for this?

docking for objectives

Great work！
Currently, I have a problem like this. In the example of Enamine50k, I just change the objective type from lookup to docking in the Enamine50k_retrain.ini, and I also updated the docking.ini file (4nuu, center = [6.69, 17.69, -7.07], size = [40, 40, 40]).However, it seems that the docking does not work properly. I saw someone asked a similar question, but I don’t know how to deal with it, thank you！

(prepare_and_run pid=201507) Information (optional):
(prepare_and_run pid=201507) --help display usage summary
(prepare_and_run pid=201507) --help_advanced display usage summary with advanced options
(prepare_and_run pid=201507) --version display program version
(prepare_and_run pid=201507)
(prepare_and_run pid=201507)
Docking: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 503/503 [00:08<00:00, 59.15ligand/s]
Exception raised! Intemediate state saved to "molpal_50k/chkpts/iter_0_2021-12-21_18-02-39/state.json"
Traceback (most recent call last):
File "run.py", line 71, in
main()
File "run.py", line 55, in main
explorer.run()
File "/home/fgq/molpal-main/molpal/explorer.py", line 317, in run
self.explore_initial()
File "/home/fgq/molpal-main/molpal/explorer.py", line 363, in explore_initial
self.write_scores(include_failed=True)
File "/home/fgq/molpal-main/molpal/explorer.py", line 542, in write_scores
top_m = self.top_explored(m)
File "/home/fgq/molpal-main/molpal/explorer.py", line 473, in top_explored
if k / len(self.scores) < 0.8:
ZeroDivisionError: division by zero
(prepare_and_run pid=201492) ERROR: docking failed. Message: Command line parse error: unrecognised option '--log=/tmp/pyscreener/session_2021-12-21_18-02-29/outputs/vina_4unn_receptor_ligand_493_0.log'
(prepare_and_run pid=201492)
(prepare_and_run pid=201492) Correct usage:
(prepare_and_run pid=201492)
(prepare_and_run pid=201492) Input:
(prepare_and_run pid=201492) --receptor arg rigid part of the receptor (PDBQT)
(prepare_and_run pid=201492) --flex arg flexible side chains, if any (PDBQT)
(prepare_and_run pid=201492) --ligand arg ligand (PDBQT)
(prepare_and_run pid=201492) --batch arg batch ligand (PDBQT)
(prepare_and_run pid=201492) --scoring arg (=vina) scoring function (ad4, vina or vinardo)

[BUG]:

Describe the bug
I am currently unable to reproduce the top-k SMILES metric MPN model results from the MolPAL paper using the code in this github repo.

Tables S1-S3 list the following %s: 59.6 +/- 2.3 (10k library), 68.8 +/- 1.0 (50k) library. My current runs repeatedly yield worse metrics: ~44 +/- 3.7 (10k library), and 64.8 +/- 1.7 (50k library). I haven't been able to run the HTS library yet, due to excessive compute time. However, these metrics differ noticeably from the manuscript.

I have confirmed that the molpal library/my conda environment reproduce the reported RF and NN results. This makes me suspect there could be something new to the current molpal MPN code driving the issue.

Example(s)
I have created shell scripts to run molpal using the MPN model over five runs, as performed in the paper. An example is shown below for the 10k library. An analogous script launches the runs using the 50k library.

rm -r ./molpal_10k/
rm -r ./our_runs/
mkdir our_runs

python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run1/

rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run2/

rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run3/

rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run4/

rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run5/

I am currently parsing the results using the fraction of top-k SMILES identified metric. Below is the script I am using to parse results.

import pandas as pd
import numpy as np

def compare(df1, df2):
    s1, s2 = list(df1.smiles), list(df2.smiles)[:100]
    dups = list( set(s1) & set(s2) )
    print('number of top scoring compounds found:', len(dups) )
    print('total number of ligands explored:', len(s1), len(set(list(s1) )) )
    print('number of unique ligands recorded in top:', len( list(set(s2))))
    print(' ')
    print(' ')
    return len(dups)

dfm = pd.read_csv('./data/Enamine10k_scores.csv.gz',index_col=None, compression='gzip')

files = ['./our_runs/run2/data/top_636_explored_iter_5.csv',\
         './our_runs/run3/data/top_636_explored_iter_5.csv', \
         './our_runs/run4/data/top_636_explored_iter_5.csv', \
         './our_runs/run5/data/top_636_explored_iter_5.csv',\
         './our_runs/run6/data/top_636_explored_iter_5.csv']

num=[]
for ff in files:
    df1 = pd.read_csv(ff,index_col=None)
    num+= [compare(df1, dfm)/100.]

num = np.array(num)
print("statistics:", np.mean(num), np.std(num) )

Here are some example outputs I have obtained

number of top scoring compounds found: 39
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100
 
number of top scoring compounds found: 45
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100
 
number of top scoring compounds found: 43
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100
 
number of top scoring compounds found: 50
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100
 
number of top scoring compounds found: 45
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100
 
statistics: 0.44400000000000006 0.035552777669262355

I see a similar drop in performance for the 50k library, although closer to the reported stats in the paper.

number of top scoring compounds found: 334
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
number of top scoring compounds found: 323
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
number of top scoring compounds found: 325
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
number of top scoring compounds found: 309
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
number of top scoring compounds found: 328
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
statistics: 0.6476 0.016560193235587575

As mentioned, the code currently reproduces the RF and NN results. For example, I show the output from the 50k library when using the NN model.

number of top scoring compounds found: 347
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
number of top scoring compounds found: 341
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
number of top scoring compounds found: 346
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
number of top scoring compounds found: 361
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
number of top scoring compounds found: 352
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
statistics: 0.6988 0.01354104870384859

This matches the 70% reported in the paper almost exactly, whereas the MPN model is off by a statistically significant amount.

Expected behavior
My current understanding of the Molpal library is that the MPN runs should reproduce the statistics reported in tables S1-S3 of the paper.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment

python version - python 3.8.13
package versions:

Package                      Version
---------------------------- ---------
absl-py                      1.1.0
aiohttp                      3.8.1
aiohttp-cors                 0.7.0
aiosignal                    1.2.0
astunparse                   1.6.3
async-timeout                4.0.2
attrs                        21.4.0
blessed                      1.19.1
cachetools                   5.2.0
certifi                      2022.6.15
charset-normalizer           2.0.12
click                        8.0.4
colorama                     0.4.4
colorful                     0.5.4
ConfigArgParse               1.5.3
cycler                       0.11.0
distlib                      0.3.4
filelock                     3.7.1
flatbuffers                  1.12
fonttools                    4.33.3
frozenlist                   1.3.0
fsspec                       2022.5.0
gast                         0.4.0
google-api-core              2.8.1
google-auth                  2.7.0
google-auth-oauthlib         0.4.6
google-pasta                 0.2.0
googleapis-common-protos     1.56.2
gpustat                      1.0.0b1
greenlet                     1.1.2
grpcio                       1.43.0
h5py                         3.7.0
idna                         3.3
importlib-metadata           4.11.4
importlib-resources          5.7.1
joblib                       1.1.0
jsonschema                   4.6.0
keras                        2.9.0
Keras-Preprocessing          1.1.2
kiwisolver                   1.4.2
libclang                     14.0.1
Markdown                     3.3.7
matplotlib                   3.5.2
msgpack                      1.0.4
multidict                    6.0.2
munkres                      1.1.4
numpy                        1.22.4
nvidia-ml-py3                7.352.0
oauthlib                     3.2.0
opencensus                   0.9.0
opencensus-context           0.1.2
OpenMM                       7.7.0
opt-einsum                   3.3.0
packaging                    21.3
pandas                       1.4.2
pdbfixer                     1.8.1
Pillow                       9.1.1
pip                          22.1.2
platformdirs                 2.5.2
prometheus-client            0.13.1
protobuf                     3.19.4
psutil                       5.9.1
py-spy                       0.3.12
pyasn1                       0.4.8
pyasn1-modules               0.2.8
pycairo                      1.21.0
pyDeprecate                  0.3.2
pyparsing                    3.0.9
pyrsistent                   0.18.1
pyscreener                   1.2.2
python-dateutil              2.8.2
pytorch-lightning            1.6.4
pytz                         2022.1
PyYAML                       6.0
ray                          1.13.0
reportlab                    3.5.68
requests                     2.28.0
requests-oauthlib            1.3.1
rsa                          4.8
scikit-learn                 1.1.1
scipy                        1.8.1
setuptools                   62.3.3
six                          1.16.0
smart-open                   6.0.0
SQLAlchemy                   1.4.37
tabulate                     0.8.9
tensorboard                  2.9.1
tensorboard-data-server      0.6.1
tensorboard-plugin-wit       1.8.1
tensorboardX                 2.5.1
tensorflow                   2.9.1
tensorflow-addons            0.17.1
tensorflow-estimator         2.9.0
tensorflow-io-gcs-filesystem 0.26.0
termcolor                    1.1.0
threadpoolctl                3.1.0
torch                        1.11.0
torchmetrics                 0.9.1
tqdm                         4.64.0
typeguard                    2.13.3
typing_extensions            4.2.0
unicodedata2                 14.0.0
urllib3                      1.26.9
virtualenv                   20.14.1
wcwidth                      0.2.5
Werkzeug                     2.1.2
wheel                        0.37.1
wrapt                        1.14.1
yarl                         1.7.2
zipp                         3.8.0

OS - linux

Checklist

all dependencies are satisifed: conda list shows the packages listed in the README
I believe so.
the unit tests are working: pytest -v reports no errors
I do see some errors here, but they don't appear relevant. I can import rdkit without an issue.

_____________________________________________________________________________________________________________ ERROR collecting tests/test_acquirer.py ______________________________________________________________________________________________________________
ImportError while importing test module '/EBS/MOLPAL_TESTS/molpal_ffn_50k/tests/test_acquirer.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/home/ubuntu/anaconda3/lib/python3.7/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_acquirer.py:8: in <module>
    from molpal.acquirer import Acquirer
molpal/__init__.py:1: in <module>
    from .explorer import Explorer
molpal/explorer.py:13: in <module>
    from molpal import acquirer, featurizer, models, objectives, pools
molpal/featurizer.py:10: in <module>
    import rdkit.Chem.rdMolDescriptors as rdmd
E   ModuleNotFoundError: No module named 'rdkit'
____________________________________________________________________________________________________________ ERROR collecting tests/test_featurizer.py _____________________________________________________________________________________________________________
ImportError while importing test module '/EBS/MOLPAL_TESTS/molpal_ffn_50k/tests/test_featurizer.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/home/ubuntu/anaconda3/lib/python3.7/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_featurizer.py:6: in <module>
    from rdkit import Chem
E   ModuleNotFoundError: No module named 'rdkit'
______________________________________________________________________________________________________________ ERROR collecting tests/test_lookup.py _______________________________________________________________________________________________________________
ImportError while importing test module '/EBS/MOLPAL_TESTS/molpal_ffn_50k/tests/test_lookup.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/home/ubuntu/anaconda3/lib/python3.7/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_lookup.py:10: in <module>
    from molpal.objectives.lookup import LookupObjective
molpal/__init__.py:1: in <module>
    from .explorer import Explorer
molpal/explorer.py:13: in <module>
    from molpal import acquirer, featurizer, models, objectives, pools
molpal/featurizer.py:10: in <module>
    import rdkit.Chem.rdMolDescriptors as rdmd
E   ModuleNotFoundError: No module named 'rdkit'
_______________________________________________________________________________________________________________ ERROR collecting tests/test_pool.py ________________________________________________________________________________________________________________
ImportError while importing test module '/EBS/MOLPAL_TESTS/molpal_ffn_50k/tests/test_pool.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/home/ubuntu/anaconda3/lib/python3.7/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_pool.py:4: in <module>
    from molpal.pools.base import MoleculePool
molpal/__init__.py:1: in <module>
    from .explorer import Explorer
molpal/explorer.py:13: in <module>
    from molpal import acquirer, featurizer, models, objectives, pools
molpal/featurizer.py:10: in <module>
    import rdkit.Chem.rdMolDescriptors as rdmd
E   ModuleNotFoundError: No module named 'rdkit'

Additional context
Add any other context about the problem here.

[QUESTION]: Are docking scores for 2.1 million member HTS Collection (“Enamine HTS”) against 4UNN PBB available?

What are you trying to do?
Requesting for some data from MolPal publication.

Hi Molpal developers,

I was wondering if from your paper, are the docking scores for the 2.1 million member HTS Collection (“Enamine HTS”) docked with Autodock Vina against 4UNN PBB available?
I was hoping to try and reproduce the results in an adaptation of this active learning apporach and it would be really helpful if these were available to compute the metrics that Figures 1, 2 & 3 displayed in the MolPal paper.

Thank you.

Best regards,
Joshua Soon

Suggest_next_molecule

Dear,
Great work.
Do you have a function to suggest the next molecule to test based on the surrogate model that is created? I mean not scoring an existing list of potential molecules (your "library"), but generating the fingerprint of the best next molecule to test according to the acquisition function?

As example, the "suggest_next_locations" function in a similar library for Bayesian optimization GpyOpt.

thanks, Lionel

[QUESTION]: Recommended paramaters for molecule screen

What are you trying to do?
Hi, thanks for your work with molpal and pyscreener! I wish to perform a screen over the 13M clean drug-like molecules in ZINC12, I was wondering if you could recommend the optimum parameters to use? For a library this size, should I use the cluster flag? I'm looking to get the top 100k hits, what k, window and delta should I set? Please see the config printout from my current setup. The abstract of the paper says "we can identify 94.8% or 89.3% of the top-50 000 ligands in a 100M member library after testing only 2.4% ", so I set the budget to 2% of my library size (260,000). I would also like the top hits to include molecules from multiple local maxima, and not just all the hits from the global maximum, which parameters encourage exploration over exploitation?

Thanks for your time!

MolPAL will be run with the following arguments:
  batch_sizes: [100]
  budget: 260000
  cache: False
  checkpoint_file: None
  chkpt_freq: 0
  cluster: False
  config: None
  cxsmiles: False
  ddp: False
  delimiter: ,
  delta: 0.01
  epsilon: 0.0
  fingerprint: pair
  fps: libraries/zinc_13m.h5
  init_size: 100
  invalid_idxs: []
  k: 100000
  length: 2048
  libraries: ['libraries/zinc_13m.txt']
  max_depth: 8
  max_iters: 10000
  metric: greedy
  min_samples_leaf: 1
  minimize: True
  model: rf
  model_seed: None
  n_estimators: 100
  ncpu: 1
  objective: docking
  objective_config: config/murd.ini
  output_dir: zinc_13m
  pool: eager
  precision: 32
  previous_scores: None
  radius: 2
  retrain_from_scratch: True
  scores_csvs: None
  seed: None
  smiles_col: 0
  test_batch_size: None
  title_line: True
  verbose: 0
  window_size: 10
  write_final: True
  write_intermediate: True

[QUESTION]: How can I build a larger datasets than Enamine HTS?

I want to build a larger datasets than Enamine HTS

I got the receptor from the PDBFixer and I try to get the score using the folowing code :

and I got the error:

I use the parameter as the articale served.Can you tell me how to get the same score as the Enamine HTS or just give me the code and receptor files you use?

how to handle tautomers?

Multiple tautomers could be generted after ligand prepartion, how should I deal with these tautomers, should I add these into pool library?

tautomers normally have differernt fingerprint.

Fingerprints not generating

Hello,

I am trying to generate fingerprints as you show in your documentation. Here is what I have done so far:

1) git clone https://github.com/coleygroup/molpal.git

2) conda env create -f environment.yml
do this in the directory inside molpal which contains the environment.yml

3) conda activate molpal

4) to test for getting rdkit into jupyter notebook,
conda create --name activelearning rdkit
conda activate activelearning
python -m ipykernel install --user --name=activelearning
jupyter notebook
and then create a notebook with the activelearning conda environment

5) start a ray cluster
redis_password=$( uuidgen 2> /dev/null )
export redis_password
ray start --head --redis-password=$redis_password --num-cpus 4 --num-gpus 1
export redis_password
export ip_head=localhost:6379

6) generate fingerprints
python fingerprints.py --library molpal/libraries/Enamine10k.csv.gz --fingerprint pair --length 2048 --radius 2 --name test_library

When I run this last command, it appears to hang up:

(molpal) cseitz@arizona:/net/gpfs-amarolab/cseitz/from_jam/projects/activelearning$ python fingerprints.py --library molpal/libraries/Enamine10k.csv.gz --fingerprint pair --length 2048 --radius 2 --name test_library
2022-01-04 19:46:31,252	INFO worker.py:826 -- Connecting to existing Ray cluster at address: 132.239.174.179:8899
Namespace(delimiter=',', fingerprint='pair', length=2048, library='molpal/libraries/Enamine10k.csv.gz', name='test_library', no_title_line=False, path='.', radius=2, smiles_col=0, title_line=True, total_size=None)

No results get generated after ~36 hours, and the command prompt does not return to the ready position. Looking at the currently running processes, it is not apparent that any of them are associated with fingerprint generation. Do you have any ideas on what I may be doing wrong? Thanks!

Best,
Christian

Y_pred.npy file

Hi
I was using molpal for a retrospective docking study. The objective configuration is to look up the already-known docking scores.

I am trying to understand the output files. I found the Y_pred.npy file is a numpy array of float point numbers. Its size is the same as my molecular library. Are these numbers the values reflecting how 'good' the corresponding compounds are so molpal will select them for next iteration exploration? or are they simply predicted docking scores by the RF regression model?
And does the order of these number follow the order of the compounds in the library file?

Below is my config file:

Many thanks

[BUG]: Not compatible with the latest Ray/Pytorch-Lightning versions

Describe the bug
If the package is installed by following the instruction on README.md, the main CLI script fails to run due to incompatibility to ray and pytorch-lightning version as of Apr. 2023. I suspect there has been a major version change in both Ray and Pytorch-lightning.

I fixed this by pinning pytorch-lightning and ray narrowly in the environment.yml file.

name: molpal

channels:
  - nvidia
  - pytorch
  - conda-forge
  - defaults

dependencies:
  - python=3.9
  - pytorch=1.13.1
  - pytorch-cuda=11.7
  - pip
  - pip:
    - configargparse
    - h5py
    - numpy
    - ray >= 1.11,<2.0
    - ray[tune]
    - rdkit
    - pytorch-lightning == 1.5.10
    - scikit-learn
    - tensorflow
    - tensorflow-addons
    - tqdm

standalone code?

Hi,

I know that you specifically mention the package is currently set to run with the docking program you have specified -- but I have built a pipeline that runs with a different docking program that works for my current model.

I was wondering if you would be willing to suggest possible ways to use the existing code base to build a model with a dataset that I have in house (it isn't proprietary, so i can share).

In essence, I have about ~15M docked compounds in format:

I have tried taking snippets of your code to see if I can just get basic sklearn RF regression model going, but I am struggling a bit. Essentially I would like to use your encoder/sklearn modules on parts of my dataset to see if I can reproduce some of the trends you observe in the paper.

The batch of 15M was picked at random from the enamine real collection (in 15 chunks) and docked using 15 iterations of our docking pipeline.

I am happy to share the docking pipeline codebase (it uses autodock GPU, and runs end-to-end from smiles to docking scores (scores are deposited in JSON objects). We can do about 15M ligands in a few days on 35gpu's and 1000 cpu cores with SGE (it is easy to adopt to slurm or another engine). My strength lies more in pipeline engineering than ML.. I am quite green with respect to the latter.

Feel free to close this if you feel it is outside the scope of issues..

my email is thomas.graham at pennmedicine.upenn.edu

If that is a better way to chat, let me know. Thanks for being the only one to release an excellent code base and superb documentation.

unable to reproduce results

Hi, I am trying to reproduce your Enamine 50k results yet unsuccessful. I was using the provided Enamine50k_online.ini config file in examples/config folder. No settings were changed except the output-dir name, I chose a different name.

These are the three things generated in the directory after the run is completed.

in the data folder, there are:

I tried to analyze these explored compounds to calculate how well they recover the top-500 scored compounds. It only achieve the 'random' performance. Is there any thing I missed?

Thanks

[QUESTION]: training dataloader is currently defined with shuffle=False. Is this intentional?

Hello,

I have really enjoyed going through the molpal codebase.

While studying the code, I noticed that the default value for shuffle in the MoleculeDataLoader class is False. Currently, train_dataloader used for training the MPN model in the train function of the MPN class in script mpnmodels.py (lines 177-179) is being created with shuffle=False.

train_dataloader = MoleculeDataLoader(
            train_data, self.batch_size, self.ncpu, pin_memory=False
        )

I am wondering if this is intentional? This could potentially lead to overfitting during training. The code below would create the dataloader with shuffle=True.

train_dataloader = MoleculeDataLoader(
         train_data, self.batch_size, self.ncpu, shuffle=True, pin_memory=False
     )

[QUESTION]: Basic MolPal usage

First of all, I would like to thank you for your work with MolPal, I'm sure this will be very useful to the drug discovery community in the future.

I am trying to use your software to predict docking scores that have been generated by my own consensus docking pipeline .

I am struggling to understand (despite your very clear documentation) how I would go about this. If I understand correctly I would be using the lookup function of Molpal. However, when I look at your examples on GitHub (for example in /examples/objective/EnamineHTS_lookup.ini which points to /data/EnamineHTS_scores.csv.gz) it seems the software is looking at the entire libraries docking scores. Obviously, I am understanding something wrong here as the point is to have to dock only a small part of the library.

These are the two ways I understand this, feel free to let me know if I'm completely off with either of these...

Assuming I dock 1% of the library to start with, I would then essentially train the model on that 1%, do one iteration of Molpal, then dock the compounds predicted to have good docking scores. Then repeat this process until I reach a total of 6 iterations as described in the paper?
Alternatively, assuming I dock 1% of the library to start with, would I then run Molpal directly using this 1% as a lookup and not have to dock the suggested compounds (except if I wanted to check the performance of the prediction?). In this case, I don't fully grasp why there would be an option to 'retrain from scratch'.

I apologize if I am missing something obvious.

Thank you for your time and assistance.

OOM Errors when running EnamineHTS_single_batch.ini on a machine with RTX3090

Thanks for great work!

I'm trying to reproduce experimental results of EnamineHTS but was unable to deal with the CUDA OOM error in inference step.

I have tried methods in #47 however the error is still here.

Here is my screenshot of nvidia-smi:

and here are tracebacks of the error:

I have no idea on how to deal with this.

Multi-task MPNN target sizes do not match

Hi! Thanks for a wonderful repository.

I am trying to train a multitask regression using your MPNN class:

model = MPNN(ncpu=12, num_tasks=2)
I am testing it with a target numpy array of shape (10000, 2)

When I run model.train(smis, targets) I get the warning:

anity check ... /home/mduranfrigola/github/ersilia-os/zaira-chem-lite/zairachemlite/model/molpal/models/mpnn/ptl/model.py:36: UserWarning: Using a target size (torch.Size([50, 1, 2])) that is different to the input size (torch.Size([50, 2])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  "rmse": lambda X, Y: torch.sqrt(F.mse_loss(X, Y, reduction="none")),
Training:   0%|                                                                                                  | 0/50 [00:00<?, ?epoch/s/home/mduranfrigola/miniconda3/envs/molpal/lib/python3.8/site-packages/torch/nn/modules/loss.py:520: UserWarning: Using a target size (torch.Size([50, 1, 2])) that is different to the input size (torch.Size([50, 2])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.

and then the following error, correspondingly:

home/mduranfrigola/github/ersilia-os/zaira-chem-lite/zairachemlite/model/molpal/models/mpnn/ptl/model.py:36: UserWarning: Using a target size (torch.Size([1, 1, 2])) that is different to the input size (torch.Size([1, 2])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  "rmse": lambda X, Y: torch.sqrt(F.mse_loss(X, Y, reduction="none")),
Traceback (most recent call last):
  File "__init__.py", line 33, in <module>
    mdl.train(smiles, targets)
  File "/home/mduranfrigola/github/ersilia-os/zaira-chem-lite/zairachemlite/model/molpal/models/mpnmodels.py", line 181, in train
    trainer.fit(lit_model, train_dataloader, val_dataloader)
  File "/home/mduranfrigola/miniconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
    self._run(model)
  File "/home/mduranfrigola/miniconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
    self.dispatch()
  File "/home/mduranfrigola/miniconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
    self.accelerator.start_training(self)
  File "/home/mduranfrigola/miniconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/mduranfrigola/miniconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "/home/mduranfrigola/miniconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()
  File "/home/mduranfrigola/miniconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
    self.train_loop.run_training_epoch()
  File "/home/mduranfrigola/miniconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 576, in run_training_epoch
    self.trainer.run_evaluation(on_epoch=True)
  File "/home/mduranfrigola/miniconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 988, in run_evaluation
    self.evaluation_loop.evaluation_epoch_end(outputs)
  File "/home/mduranfrigola/miniconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 213, in evaluation_epoch_end
    model.validation_epoch_end(outputs)
  File "/home/mduranfrigola/github/ersilia-os/zaira-chem-lite/zairachemlite/model/molpal/models/mpnn/ptl/model.py", line 88, in validation_epoch_end
    val_loss = torch.cat(outputs).mean()
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 50 but got size 1 for tensor number 40 in the list.
                                                                                                                                          Exception ignored in: <function tqdm.__del__ at 0x7fa07b9a1790>                                                                             
Traceback (most recent call last):
  File "/home/mduranfrigola/.local/lib/python3.8/site-packages/tqdm/std.py", line 1124, in __del__
  File "/home/mduranfrigola/.local/lib/python3.8/site-packages/tqdm/std.py", line 1337, in close
  File "/home/mduranfrigola/.local/lib/python3.8/site-packages/tqdm/std.py", line 1516, in display
  File "/home/mduranfrigola/.local/lib/python3.8/site-packages/tqdm/std.py", line 1127, in __repr__
  File "/home/mduranfrigola/.local/lib/python3.8/site-packages/tqdm/std.py", line 1477, in format_dict
TypeError: cannot unpack non-iterable NoneType object

Is there anything I am doing wrong?

Many thanks!

can you please provide an example config file that use docking rather than lookup?

i can't find any in this repo

[QUESTION]:

Hello,

I am using molpal with docking and it works well. I am using a subset of eMolecules library (~1M compounds) and trying to do a VS against a target. I am using a --init-size 0.005 and --batch-size 0.005.

The only thing I am not able to control is the number of CPUs on which the vina docking works. I am using ray start --head and my VM has 24 cpus and gpus. MolPAl only uses 8 out ot the 24 CPUs.

This problem is not there when I use pyscreener directly and it utilizes all the cpus. Can you please suggest which parameter should I change to make the docking explorations run on all 24 cpus?

Thanks
Sandeep

MPN model error

Hello! Thanks for great job.

I am trying to run a lookup objective using mpn model.

However, I got a error message:

  File "run.py", line 71, in <module>
    main()
  File "run.py", line 55, in main
    explorer.run()
  File "/home/njgoo/Data1/program/molpal/molpal/explorer.py", line 326, in run
    self.explore_batch()
  File "/home/njgoo/Data1/program/molpal/molpal/explorer.py", line 396, in explore_batch
    self._update_model()
  File "/home/njgoo/Data1/program/molpal/molpal/explorer.py", line 729, in _update_model
    self.model.train(
  File "/home/njgoo/Data1/program/molpal/molpal/models/mpnmodels.py", line 396, in train
    return self.model.train(xs, ys)
  File "/home/njgoo/Data1/program/molpal/molpal/models/mpnmodels.py", line 181, in train
    trainer.fit(lit_model, train_dataloader, val_dataloader)
  File "/home/njgoo/Data1/program/anaconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in fit
    self._call_and_handle_interrupt(
  File "/home/njgoo/Data1/program/anaconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/njgoo/Data1/program/anaconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/njgoo/Data1/program/anaconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1193, in _run
    self._dispatch()
  File "/home/njgoo/Data1/program/anaconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1272, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/njgoo/Data1/program/anaconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home/njgoo/Data1/program/anaconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1282, in run_stage
    return self._run_train()
  File "/home/njgoo/Data1/program/anaconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1312, in _run_train
    self.fit_loop.run()
  File "/home/njgoo/Data1/program/anaconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/njgoo/Data1/program/anaconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
    self.epoch_loop.run(data_fetcher)
  File "/home/njgoo/Data1/program/anaconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/njgoo/Data1/program/anaconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 220, in advance
    self.trainer.call_hook("on_train_batch_end", batch_end_outputs, batch, batch_idx, **extra_kwargs)
  File "/home/njgoo/Data1/program/anaconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1477, in call_hook
    callback_fx(*args, **kwargs)
  File "/home/njgoo/Data1/program/anaconda3/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 179, in on_train_batch_end
    callback.on_train_batch_end(self, self.lightning_module, outputs, batch, batch_idx, 0)
  File "/home/njgoo/Data1/program/molpal/molpal/models/mpnn/ptl/callbacks.py", line 41, in on_train_batch_end
    super().on_train_batch_end(
TypeError: on_train_batch_end() takes 6 positional arguments but 7 were given

How do I fix this problem?

Thank you!!

molpal crush due to invalid smiles

Basically, rdkit mmff can't generate conformation for some smiles, which cause molpal crush.

########################################################################
File "/work/home/aixplorerbio_wz/ylk/molpal-main/run.py", line 71, in
main()
File "/work/home/aixplorerbio_wz/ylk/molpal-main/run.py", line 55, in main
explorer.run()
File "/work/home/aixplorerbio_wz/ylk/molpal-main/molpal/explorer.py", line 317, in run
self.explore_initial()
File "/work/home/aixplorerbio_wz/ylk/molpal-main/molpal/explorer.py", line 355, in explore_initial
new_scores = self.objective(inputs)
File "/work/home/aixplorerbio_wz/ylk/molpal-main/molpal/objectives/base.py", line 27, in call
return self.forward(*args, **kwargs)
File "/work/home/aixplorerbio_wz/ylk/molpal-main/molpal/objectives/docking.py", line 90, in forward
Y = self.c * self.virtual_screen(smis)
File "/work/home/aixplorerbio_wz/miniconda3/envs/molpal/lib/python3.8/site-packages/pyscreener/docking/screen.py", line 166, in call
completed_simulationsss = self.run(planned_simulationsss)
File "/work/home/aixplorerbio_wz/miniconda3/envs/molpal/lib/python3.8/site-packages/pyscreener/docking/screen.py", line 275, in run
return [
File "/work/home/aixplorerbio_wz/miniconda3/envs/molpal/lib/python3.8/site-packages/pyscreener/docking/screen.py", line 276, in
[ray.get(refs) for refs in refss]
File "/work/home/aixplorerbio_wz/miniconda3/envs/molpal/lib/python3.8/site-packages/pyscreener/docking/screen.py", line 276, in
[ray.get(refs) for refs in refss]
File "/work/home/aixplorerbio_wz/miniconda3/envs/molpal/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/work/home/aixplorerbio_wz/miniconda3/envs/molpal/lib/python3.8/site-packages/ray/worker.py", line 1733, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::prepare_and_run() (pid=3206, ip=10.3.3.62)
File "/work/home/aixplorerbio_wz/miniconda3/envs/molpal/lib/python3.8/site-packages/pyscreener/docking/vina/runner.py", line 67, in prepare_and_run
VinaRunner.prepare_ligand(data)
File "/work/home/aixplorerbio_wz/miniconda3/envs/molpal/lib/python3.8/site-packages/pyscreener/docking/vina/runner.py", line 75, in prepare_ligand
VinaRunner.prepare_from_smi(data)
File "/work/home/aixplorerbio_wz/miniconda3/envs/molpal/lib/python3.8/site-packages/pyscreener/docking/vina/runner.py", line 97, in prepare_from_smi
Chem.MMFFOptimizeMolecule(mol)
ValueError: Bad Conformer Id
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/work/home/aixplorerbio_wz/ylk/molpal-main/molpal/objectives/docking.py", line 101, in cleanup
writer.writerow(field.name for field in dataclasses.fields(results[0]))
IndexError: list index out of range
^MDocking: 1%|▏ | 585/40000 [23:56<26:53:23, 2.46s/ligand]
###################################################################################################

Question for command

Hi!
I'm very interesting your program!

Is it possible to run the docking and lookup objective simultaneously?
command like:

python run.py --objective docking .... --objective lookup ....

Thank you for sharing.

[BUG]: unable to run molpal in docking objective

Iam able to run molpal for lookup objective but unable to run it for docking.

i changed the objective configuration file as well but facing some bug.

i am using ubuntu 20.04

please find the attached screenshot

please guide me

Bug in explorer

Hello, I am trying to run my data.
While the program is running, I got the problem of two cases.

First error is below,

Exception raised! Intemediate state saved to "molpal_stock/chkpts/iter_25_2021-12-01_22-26-33/state.json"

Traceback (most recent call last):
  File "/home/njgoo/Data1/program/molpal/run.py", line 73, in <module>
    main()
  File "/home/njgoo/Data1/program/molpal/run.py", line 57, in main
    explorer.run()
  File "/home/njgoo/Data1/program/molpal/molpal/explorer.py", line 326, in run
    self.explore_batch()
  File "/home/njgoo/Data1/program/molpal/molpal/explorer.py", line 422, in explore_batch
    return sum(valid_scores)/len(valid_scores)
ZeroDivisionError: division by zero

I think that valid_scores could be empty so len(valid_scores) is zero.

Second error is below,

Finished exploring!
Exception raised! Intemediate state saved to "molpal_stock/chkpts/iter_51_2021-12-02_10-16-51/state.json"
Traceback (most recent call last):
  File "run.py", line 73, in <module>
    main()
  File "run.py", line 57, in main
    explorer.run()
  File "/home/njgoo/Data1/program/molpal/molpal/explorer.py", line 329, in run
    print(f'FINAL TOP-{self.k} AVE: {self.top_k_avg:0.3f} | '
TypeError: unsupported format string passed to NoneType.__format__

I wonder that the my job is raised to wrong process.

What do I check a intermediate data for fixing bug?

Thank you

test on in-house data, result is bad

hi,
thanks for your work.

i have used some of our in-house docking data to test molpal. our data contains ~200k molecular and its corresponding affinity.

the result its rather bad, compare to "real" docking data, only 168 molecular appear in the top 2k, namely recovery rate = 8.4%.

below is the test config file:

#############################################
path = results/greedy_top001_win10_delta01
window-size = 10
delta = 0.1
max-iters = 10
budget = 1.0
write-final = True
write-intermediate = True
retrain-from-scratch = True
ncpu = 1
fingerprint = pair
radius = 2
length = 2048
pool = eager
libraries = [/library/viva_196370_smiles.csv]
delimiter = ,
fps = library/viva_196370_smiles.h5
invalid-idxs = []
metric = greedy
init-size = 0.01
batch-sizes = [0.01]
objective = lookup
minimize = True
objective-config = /objective/viva20w_lookup.ini
model = rf
n-estimators = 100
max-depth = 8
min-samples-leaf = 1
precision = 32
top-k = 0.01
######################################################

where i did it wrongly? any feedback would be appreciated！

coleygroup / molpal Goto Github PK

molpal's People

Contributors

Stargazers

Watchers

Forkers

molpal's Issues

Recommend Projects

Recommend Topics

Recommend Org