gnina / scripts Goto Github PK

License: BSD 3-Clause "New" or "Revised" License

Python 4.85% Jupyter Notebook 95.15%

scripts's Introduction

gnina (pronounced NEE-na) is a molecular docking program with integrated support for scoring and optimizing ligands using convolutional neural networks. It is a fork of smina, which is a fork of AutoDock Vina.

Help

Please subscribe to our slack team. An example colab notebook showing how to use gnina is available here. We also hosted a workshp on using gnina (video, slides).

Citation

If you find gnina useful, please cite our paper(s):

GNINA 1.0: Molecular docking with deep learning (Primary application citation)
A McNutt, P Francoeur, R Aggarwal, T Masuda, R Meli, M Ragoza, J Sunseri, DR Koes. J. Cheminformatics, 2021
link PubMed ChemRxiv

Protein–Ligand Scoring with Convolutional Neural Networks (Primary methods citation)
M Ragoza, J Hochuli, E Idrobo, J Sunseri, DR Koes. J. Chem. Inf. Model, 2017
link PubMed arXiv

Ligand pose optimization with atomic grid-based convolutional neural networks
M Ragoza, L Turner, DR Koes. Machine Learning for Molecules and Materials NIPS 2017 Workshop, 2017
arXiv

Visualizing convolutional neural network protein-ligand scoring
J Hochuli, A Helbling, T Skaist, M Ragoza, DR Koes. Journal of Molecular Graphics and Modelling, 2018
link PubMed arXiv

Convolutional neural network scoring and minimization in the D3R 2017 community challenge
J Sunseri, JE King, PG Francoeur, DR Koes. Journal of computer-aided molecular design, 2018
link PubMed

Three-Dimensional Convolutional Neural Networks and a Cross-Docked Data Set for Structure-Based Drug Design
PG Francoeur, T Masuda, J Sunseri, A Jia, RB Iovanisci, I Snyder, DR Koes. J. Chem. Inf. Model, 2020
link PubMed Chemrxiv

Virtual Screening with Gnina 1.0 J Sunseri, DR Koes D. Molecules, 2021 link Preprints

Docker

A pre-built docker image is available here and Dockerfiles are here.

Installation

We recommend that you use the pre-built binary unless you have significant experience building software on Linux, in which case building from source might result in an executable more optimized for your system.

Ubuntu 22.04

apt-get  install build-essential git cmake wget libboost-all-dev libeigen3-dev libgoogle-glog-dev libprotobuf-dev protobuf-compiler libhdf5-dev libatlas-base-dev python3-dev librdkit-dev python3-numpy python3-pip python3-pytest libjsoncpp-dev

Follow NVIDIA's instructions to install the latest version of CUDA (>= 11.0 is required). Make sure nvcc is in your PATH.

Optionally install cuDNN.

Install OpenBabel3. Note there are errors in bond order determination in version 3.1.1 and older.

git clone https://github.com/openbabel/openbabel.git
cd openbabel
mkdir build
cd build
cmake -DWITH_MAEPARSER=OFF -DWITH_COORDGEN=OFF -DPYTHON_BINDINGS=ON -DRUN_SWIG=ON ..
make
make install

Install gnina

git clone https://github.com/gnina/gnina.git
cd gnina
mkdir build
cd build
cmake ..
make
make install

WSL2 Ubuntu 22.04

sudo apt-get remove nvidia-cuda-toolkit
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
chmod 700 cuda_12.4.0_550.54.14_linux.run
sudo sh cuda_12.4.0_550.54.14_linux.run
wget https://developer.download.nvidia.com/compute/cudnn/9.0.0/local_installers/cudnn-local-repo-ubuntu2204-9.0.0_1.0-1_amd64.deb
sudo dpkg -i cudnn-local-repo-ubuntu2204-9.0.0_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-ubuntu2204-9.0.0/cudnn-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cudnn-cuda-12
apt-get install build-essential git cmake wget libboost-all-dev libeigen3-dev libgoogle-glog-dev libprotobuf-dev protobuf-compiler libhdf5-dev libatlas-base-dev python3-dev librdkit-dev python3-numpy python3-pip python3-pytest libjsoncpp-dev

git clone https://github.com/openbabel/openbabel.git
cd openbabel
mkdir build
cd build
cmake -DWITH_MAEPARSER=OFF -DWITH_COORDGEN=OFF -DPYTHON_BINDINGS=ON -DRUN_SWIG=ON ..
make -j8
sudo make install

git clone https://github.com/gnina/gnina.git
cd gnina
mkdir build
cd build
cmake ..
make -j8
sudo make install

If you are building for systems with different GPUs (e.g. in a cluster environment), configure with -DCUDA_ARCH_NAME=All.
Note that the cmake build will automatically fetch and install libmolgrid if it is not already installed.

The scripts provided in gnina/scripts have additional python dependencies that must be installed.

Usage

To dock ligand lig.sdf to a binding site on rec.pdb defined by another ligand orig.sdf:

gnina -r rec.pdb -l lig.sdf --autobox_ligand orig.sdf -o docked.sdf.gz

To perform docking with flexible sidechain residues within 3.5 Angstroms of orig.sdf (generally not recommend unless prior knowledge indicates pocket is highly flexible):

gnina -r rec.pdb -l lig.sdf --autobox_ligand orig.sdf --flexdist_ligand orig.sdf --flexdist 3.5 -o flex_docked.sdf.gz

To perform whole protein docking:

gnina -r rec.pdb -l lig.sdf --autobox_ligand rec.pdb -o whole_docked.sdf.gz --exhaustiveness 64

To utilize the default ensemble CNN in the energy minimization during the refinement step of docking (10 times slower than the default rescore option):

gnina -r rec.pdb -l lig.sdf --autobox_ligand orig.sdf --cnn_scoring refinement -o cnn_refined.sdf.gz

To utilize the default ensemble CNN for every step of docking (1000 times slower than the default rescore option):

gnina -r rec.pdb -l lig.sdf --autobox_ligand orig.sdf --cnn_scoring all -o cnn_all.sdf.gz

To utilize all empirical scoring using the Vinardo scoring function:

gnina -r rec.pdb -l lig.sdf --autobox_ligand orig.sdf --scoring vinardo --cnn_scoring none -o vinardo_docked.sdf.gz

To utilize a different CNN during docking (see help for possible options):


gnina -r rec.pdb -l lig.sdf --autobox_ligand orig.sdf --cnn dense -o dense_docked.sdf.gz

To minimize and score ligands ligs.sdf already positioned in a binding site:

gnina -r rec.pdb -l ligs.sdf --minimize -o minimized.sdf.gz

To covalently dock a pyrazole to a specific iron atom on the receptor with the bond formed between a nitrogen of the pyrazole and the iron.

gnina  -r rec.pdb.gz -l conformer.sdf.gz --autobox_ligand bindingsite.sdf.gz --covalent_rec_atom A:601:FE --covalent_lig_atom_pattern '[$(n1nccc1)]' -o output.sdf.gz

The same as above, but with the covalently bonding ligand atom manually positioned (instead of using OpenBabel binding heuristics) and the ligand/residue complex UFF optimized.

gnina  -r rec.pdb.gz -l conformer.sdf.gz --autobox_ligand bindingsite.sdf.gz --covalent_lig_atom_position -11.796,31.887,72.682  --covalent_optimize_lig  --covalent_rec_atom A:601:FE --covalent_lig_atom_pattern '[$(n1nccc1)]' -o output.sdf.gz

All options:

Input:
  -r [ --receptor ] arg              rigid part of the receptor
  --flex arg                         flexible side chains, if any (PDBQT)
  -l [ --ligand ] arg                ligand(s)
  --flexres arg                      flexible side chains specified by comma 
                                     separated list of chain:resid
  --flexdist_ligand arg              Ligand to use for flexdist
  --flexdist arg                     set all side chains within specified 
                                     distance to flexdist_ligand to flexible
  --flex_limit arg                   Hard limit for the number of flexible 
                                     residues
  --flex_max arg                     Retain at at most the closest flex_max 
                                     flexible residues

Search space (required):
  --center_x arg                     X coordinate of the center
  --center_y arg                     Y coordinate of the center
  --center_z arg                     Z coordinate of the center
  --size_x arg                       size in the X dimension (Angstroms)
  --size_y arg                       size in the Y dimension (Angstroms)
  --size_z arg                       size in the Z dimension (Angstroms)
  --autobox_ligand arg               Ligand to use for autobox
  --autobox_add arg                  Amount of buffer space to add to 
                                     auto-generated box (default +4 on all six 
                                     sides)
  --autobox_extend arg (=1)          Expand the autobox if needed to ensure the
                                     input conformation of the ligand being 
                                     docked can freely rotate within the box.
  --no_lig                           no ligand; for sampling/minimizing 
                                     flexible residues

Covalent docking:
  --covalent_rec_atom arg            Receptor atom ligand is covalently bound 
                                     to.  Can be specified as 
                                     chain:resnum:atom_name or as x,y,z 
                                     Cartesian coordinates.
  --covalent_lig_atom_pattern arg    SMARTS expression for ligand atom that 
                                     will covalently bind protein.
  --covalent_lig_atom_position arg   Optional.  Initial placement of covalently
                                     bonding ligand atom in x,y,z Cartesian 
                                     coordinates.  If not specified, 
                                     OpenBabel's GetNewBondVector function will
                                     be used to position ligand.
  --covalent_fix_lig_atom_position   If covalent_lig_atom_position is 
                                     specified, fix the ligand atom to this 
                                     position as opposed to using this position
                                     to define the initial structure.
  --covalent_bond_order arg (=1)     Bond order of covalent bond. Default 1.
  --covalent_optimize_lig            Optimize the covalent complex of ligand 
                                     and residue using UFF. This will change 
                                     bond angles and lengths of the ligand.

Scoring and minimization options:
  --scoring arg                      specify alternative built-in scoring 
                                     function: ad4_scoring default dkoes_fast 
                                     dkoes_scoring dkoes_scoring_old vina 
                                     vinardo
  --custom_scoring arg               custom scoring function file
  --custom_atoms arg                 custom atom type parameters file
  --score_only                       score provided ligand pose
  --local_only                       local search only using autobox (you 
                                     probably want to use --minimize)
  --minimize                         energy minimization
  --randomize_only                   generate random poses, attempting to avoid
                                     clashes
  --num_mc_steps arg                 fixed number of monte carlo steps to take 
                                     in each chain
  --max_mc_steps arg                 cap on number of monte carlo steps to take
                                     in each chain
  --num_mc_saved arg                 number of top poses saved in each monte 
                                     carlo chain
  --temperature arg                  temperature for metropolis accept 
                                     criterion
  --minimize_iters arg (=0)          number iterations of steepest descent; 
                                     default scales with rotors and usually 
                                     isn't sufficient for convergence
  --accurate_line                    use accurate line search
  --simple_ascent                    use simple gradient ascent
  --minimize_early_term              Stop minimization before convergence 
                                     conditions are fully met.
  --minimize_single_full             During docking perform a single full 
                                     minimization instead of a truncated 
                                     pre-evaluate followed by a full.
  --approximation arg                approximation (linear, spline, or exact) 
                                     to use
  --factor arg                       approximation factor: higher results in a 
                                     finer-grained approximation
  --force_cap arg                    max allowed force; lower values more 
                                     gently minimize clashing structures
  --user_grid arg                    Autodock map file for user grid data based
                                     calculations
  --user_grid_lambda arg (=-1)       Scales user_grid and functional scoring
  --print_terms                      Print all available terms with default 
                                     parameterizations
  --print_atom_types                 Print all available atom types

Convolutional neural net (CNN) scoring:
  --cnn_scoring arg (=1)             Amount of CNN scoring: none, rescore 
                                     (default), refinement, metrorescore 
                                     (metropolis+rescore), metrorefine 
                                     (metropolis+refine), all
  --cnn arg                          built-in model to use, specify 
                                     PREFIX_ensemble to evaluate an ensemble of
                                     models starting with PREFIX: 
                                     crossdock_default2018 
                                     crossdock_default2018_1 
                                     crossdock_default2018_2 
                                     crossdock_default2018_3 
                                     crossdock_default2018_4 default2017 dense 
                                     dense_1 dense_2 dense_3 dense_4 
                                     general_default2018 general_default2018_1 
                                     general_default2018_2 
                                     general_default2018_3 
                                     general_default2018_4 redock_default2018 
                                     redock_default2018_1 redock_default2018_2 
                                     redock_default2018_3 redock_default2018_4
  --cnn_model arg                    caffe cnn model file; if not specified a 
                                     default model will be used
  --cnn_weights arg                  caffe cnn weights file (*.caffemodel); if 
                                     not specified default weights (trained on 
                                     the default model) will be used
  --cnn_resolution arg (=0.5)        resolution of grids, don't change unless 
                                     you really know what you are doing
  --cnn_rotation arg (=0)            evaluate multiple rotations of pose (max 
                                     24)
  --cnn_update_min_frame arg (=1)    During minimization, recenter coordinate 
                                     frame as ligand moves
  --cnn_freeze_receptor              Don't move the receptor with respect to a 
                                     fixed coordinate system
  --cnn_mix_emp_force                Merge CNN and empirical minus forces
  --cnn_mix_emp_energy               Merge CNN and empirical energy
  --cnn_empirical_weight arg (=1)    Weight for scaling and merging empirical 
                                     force and energy 
  --cnn_outputdx                     Dump .dx files of atom grid gradient.
  --cnn_outputxyz                    Dump .xyz files of atom gradient.
  --cnn_xyzprefix arg (=gradient)    Prefix for atom gradient .xyz files
  --cnn_center_x arg                 X coordinate of the CNN center
  --cnn_center_y arg                 Y coordinate of the CNN center
  --cnn_center_z arg                 Z coordinate of the CNN center
  --cnn_verbose                      Enable verbose output for CNN debugging

Output:
  -o [ --out ] arg                   output file name, format taken from file 
                                     extension
  --out_flex arg                     output file for flexible receptor residues
  --log arg                          optionally, write log file
  --atom_terms arg                   optionally write per-atom interaction term
                                     values
  --atom_term_data                   embedded per-atom interaction terms in 
                                     output sd data
  --pose_sort_order arg (=0)         How to sort docking results: CNNscore 
                                     (default), CNNaffinity, Energy
  --full_flex_output                 Output entire structure for out_flex, not 
                                     just flexible residues.

Misc (optional):
  --cpu arg                          the number of CPUs to use (the default is 
                                     to try to detect the number of CPUs or, 
                                     failing that, use 1)
  --seed arg                         explicit random seed
  --exhaustiveness arg (=8)          exhaustiveness of the global search 
                                     (roughly proportional to time)
  --num_modes arg (=9)               maximum number of binding modes to 
                                     generate
  --min_rmsd_filter arg (=1)         rmsd value used to filter final poses to 
                                     remove redundancy
  -q [ --quiet ]                     Suppress output messages
  --addH arg                         automatically add hydrogens in ligands (on
                                     by default)
  --stripH arg                       remove hydrogens from molecule _after_ 
                                     performing atom typing for efficiency (off
                                     by default)
  --device arg (=0)                  GPU device to use
  --no_gpu                           Disable GPU acceleration, even if 
                                     available.

Configuration file (optional):
  --config arg                       the above options can be put here

Information (optional):
  --help                             display usage summary
  --help_hidden                      display usage summary with hidden options
  --version                          display program version

CNN Scoring

--cnn_scoring determines at what points of the docking procedure that the CNN scoring function is used.

none - No CNNs used for docking. Uses the specified empirical scoring function throughout.
rescore (default) - CNN used for reranking of final poses. Least computationally expensive CNN option.
refinement - CNN used to refine poses after Monte Carlo chains and for final ranking of output poses. 10x slower than rescore when using a GPU.
all - CNN used as the scoring function throughout the whole procedure. Extremely computationally intensive and not recommended.

The default CNN scoring function is an ensemble of 5 models selected to balance pose prediction performance and runtime: dense, general_default2018_3, dense_3, crossdock_default2018, and redock_default2018. More information on these various models can be found in the papers listed above.

Training

Scripts to aid in training new CNN models can be found at https://github.com/gnina/scripts and sample models at https://github.com/gnina/models.

The DUD-E docked poses used in the original paper can be found here and the CrossDocked2020 set is here.

License

gnina is dual licensed under GPL and Apache. The GPL license is necessitated by the use of OpenBabel (which is GPL licensed). In order to use gnina under the Apache license only, all references to OpenBabel must be removed from the source code.

scripts's People

Contributors

Stargazers

Watchers

Forkers

aax5 lmturner20 acruzpr wingring47 jazzcharles ruska612 flovey hwartmann francoep djinnome githublucien ericchen521 amritan1707 javierdedioscastro jscant quipcam rmeli klatif01 ramkicse karan469 deep-learning-aided-drug-designing drewnutt snaveesh jorenretel huynhtai29 kumarapurv pouyapouryaie zw-simm poiygon ceachi-archive roberto-ungarelli madhusudhan411994 reward-dboyd irafaelfurtado netapradipti isterra taylorbn bfalsayed arnoldart takshan fx-ren moustaphadiakhate iq-97 daominhkhanh20 ahmadcahyana jamm3e3333 iassenegal221 pawelkolanowski wolfteinter avyan2014 vietthang197 aberto20 heoun glc-kbekri baadadmin mfactory88 justuff udayarccdw arunsandu gm85636 arctouch-isissilva sogekng kpavel we85 andreasaaro robertomaclean siroshbashir01 taufik-hidayat jasminesychong eh-piob aeroumar ewoudverhelst shaedul thepowerligand zgturan brojackvn guydurant basel2053 ibanezdenis28 gaoshan2006 kidto1412 rizky742

scripts's Issues

error when running train.py

Hi
I tried to run train.py with my data set
following errors occcured

F0830 00:47:20.333189 25394 molgrid_data_layer.cpp:600] Check failed: pose < ex.sets.size()-1 (0 vs. 0) Incorrect pose index

I use the following command line
python train.py -m hires_pose.model -p total1_ -d types -o total1_hires_pos

dataset is total1_train0.types, total1_test0.types and gninatype file is in types folder

What could be the reason for this kind of error

system has unsupported display driver / cuda driver combination

python3 train.py -p first_ -d /root/data --dynamic

  0           test first_test0.types
  0          train first_train0.types
  1           test first_test1.types
  1          train first_train1.types
  2           test first_test2.types
  2          train first_train2.types
WRITING solver.36822.prototxt

Traceback (most recent call last):
  File "/root/data/gnina_train/train.py", line 932, in <module>
    results = train_and_test_model(args, train_test_files[i], outname, cont)
  File "/root/data/gnina_train/train.py", line 441, in train_and_test_model
    solver = caffe.get_solver(solverf)
RuntimeError: system has unsupported display driver / cuda driver combination

SYSTEM INFORMATION

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

ls /usr/local
bin cuda cuda-11 cuda-11.7 etc games include lib man python sbin share src

How to create cache/types file for multiple receptors and ligands for PDBbind dataset?

How to create cache file for multiple receptors and ligands (they are in gninatypes format)

1.I know I can use gninatyper to transfer ligands and receptors into gninatypes, but I do not know exacly how to use create_cache2py to generate cache file for a pair of ligand and receptor(they are in a independent dir named by PDBid, and definitly there are docked poses and crystal ligands). 【BTW, is it necessary to transfer them into cache rather than types file concerning with the time consument?】

And I also wonder is it possible to generate a cache file(or a types file) for the whole dataset just like ccv_*.types file, which is one train/test file containing thousands of complexes. Is this job need several cross-validate scripts pipeline?

Specially, I have no idea how to add the rmsd and affnity into the types or cahe file, are they call for csv files?

Here is the cache2,py description, without indicating what the -fname is, without mentioning about how to add rmsd and affnity data to the file line.
'''Takes a bunch of types training files. First argument is what index the receptor starts on (ligands are assumed to be right after). Reads in the gninatypes files specified in these types files and writes out two monolithic receptor and ligand cache files in version 2 format.
Version 2 is optimized for memory mapped storage of caches. keys (file names) are stored first followed by dense storage of values (coordinates and types).
'''

Is there a model template of redock_default2018 model?

Hello, Thanks for your contribution to the SBDD community, you guys are doing almost ten years about AI + SBDD project, I have read papers of those wonderful jobs.

Recently, I want to reproduce the default models you guys had made but I could not find any related config or script to reproduce it. I want to understand the details behind those model, hopefully you guys can provide more information.

Below is the list of models I want to reproduce.
crossdock_default2018
default2017
dense
general_default2018
redock_default2018

Thanks again.

TypeError: Couldn't build proto file into descriptor pool

TypeError: Couldn't build proto file into descriptor pool

How can I fix this problem?

Details in train.py output

Dear professor and developers, I come up against a few questions about the procedures in training.

I am not sure whether I should make these six folds almost the same sample amount, or should every train:test be 8:2 ?
After I train the model, one of the default2018*.out files contains: 0.801041 0.795815 0.558788 0.010000 1.746990 1.742589. the first two metirc are sequentially max_AUC min_AUC ... WAHT ARE THE FOUR NUMBERS.

Error when using train.py draw final plot

warning:
Error when using train.py draw final plot

How to create protein XXX_nowat.pdb and correct gninatypes files

I am trying to re-create those pre-processed data. I am encountering 2 problems for creating protein's gninatypes files.

Remove water and hydrogen from XXX_protein.pdb to generate XXX_nowat.pdb. I only retain those lines beginning with "ATOM" except for hydrogen by my own code. However, sometimes my results are a bit different with XXX_nowat.pdb in the downloaded datasets. I notice some of the XXX_nowat.pdb in datasets cut the protein chains "B" compared with the original protein PDB file if the protein has more than 2 chains, but not always. Is there a script to create XXX_nowat.pdb to share with me ?

2）Create XXX_rec_0.gninatypes. Even though I am using the ready-made XXX_nowat.pdb file to create gninatypes files by "gninatyper". The files generated by my gninatyper are also different with the existing XXX_rec_0.gninatypes. I compared the 2 result files and find the atoms coordinate are actually right, but some atoms' types are different . E.g. the "NitrogenXSDonor" in the ready-made file is labeled as "NitrogenXSAcceptor" in my result, and "AromaticCarbonXSHydrophobe" is labeled as "AliphaticCarbonXSHydrophobe". Is it caused by my gnina version , "master:e60ccc0+", newer than the ready-made gninatypes files ?
Thanks a lot !

Issues about --allfolds result

after --allfolds training.py, there is no results but a caffemodel:

how to solve: Check failed: end < layers_.size() (-1 vs. 0)

Dear all,
I am a beginner and trying the first time to train a model using the train.py and the default2018.model. I am receiving this error message:

I1204 12:22:43.597291 9247 solver.cpp:352] Iteration 0, Testing net (#0)
I1204 12:22:43.597334 9247 solver.cpp:352] Iteration 0, Testing net (#1)
F1204 12:22:43.597359 9247 net.cpp:523] Check failed: end < layers_.size() (-1 vs. 0)
*** Check failure stack trace: ***

How can I solve this?
Thanks a lot!

default ITERATIONS dismatch

when in the README, the train.py has default iterations=10000; BUT actually in the script, parser.add_argument('-i','--iterations',type=int,required=False,help="Number of iterations to run,default 250,000",default=250000)

This would lead to some mistake if people didn't notice.

Any examples for generating a gninatypes file?

I have completed the libmolgrid / gnina settings to calculate the binding affinity of the data I have.

The tests of train.py, which create a new model with existing data, and predict.py, which make predictions, have been completed.
(crossdocked2020 dataset and default2018 model )

However, it cannot be predicted with the data I have.

My data :
protein.pdb
protein_rec.pdb
molecule.sdf

protein is the total protein structure information data received from the PDB, protein_rec is data extracted only from the binding site, and 3D information of the molecule is included in sdf.

Any examples or explanations on how to change these data to .gninatypes ?

eval model

if I want to evaluate the trained caffemodel, I should use train.py --test_only?

questions about generate_unique_lig_poses.py

generate_unique_lig_poses.py - Script for counter-example generation which computes all of the unique ligand poses in a directory

Is the counter-example mean a decoy ligand pose during Cross-dock?
Can I use this script to generate ligand-poses as MC sampling does?

No caffe

Traceback (most recent call last):
  File "/home/gnina_train/train.py", line 11, in <module>
    import caffe

According to the README, when I built gnina from source, the caffe should already build in my python3. But now it is missing? What should I do?

--allfolds KeyError

when I use --allfolds flag

KeyError: 'train'

How does percent_reduced reduce the dataset?

When I looked into the train.py script, I found this parameter a little mysterious.

use_reduced = bool(args.reduced or args.percent_reduced)

According to this code, it should be a parameter to reduce the sample of the training set and test set(or not?).

Early stopping criteria

According to your paper "Early stopping evaluates the loss of the trained network every 1000 iterations on a sample of the training set. The size of this sample is determined by the percent_reduced parameter to [train.py].", this percent_reduced is something about the training set but also not sure whether related to test set. And I wonder if I set percent_reduced to 5, what number of Early Stopping will be?

ChatGPT ANSWER

According to ChatGPT, the percent_reduced is something about the learning_rate: "In gnina, the percent_reduced parameter does not directly reduce the dataset. My earlier answer was inaccurate. I apologize for the confusion.
The percent_reduced parameter in gnina's train.py script is used to specify the percentage of reduced learning rate schedule to be applied during training. It controls the rate at which the learning rate is reduced over time.
The reduction of dataset samples can be achieved through other means, such as data preprocessing steps or by providing a subset of the original dataset to the train.py script using other command-line arguments or data loading techniques. "
SO, is its answer totally wrong?

No valided stratified example error

I have a question related to balancing and stratify_receptor.

When I train a model with default2018, It does not have any problem,
But when I tried to train a model with dense, there are following error

File "train.py", line 935, in
results = train_and_test_model(args, train_test_files[i], outname, cont)
File "train.py", line 502, in train_and_test_model
solver.step(test_interval)
ValueError: No valid stratified examples.

The setting of balancing and stratify_receptor is same between dense.model and default2018.model

and in my training data there are positive and negative labeled samples.

Any possible clue to this error?

Thanks

How to create ligand UFF optimized and minimized poses ?

I see there are XXX_ligand.sdf , XXX_min.sdf and XXX_uff.sdf files for most of ligands in PDBbind2016 dataset. ( However there are still a few of ligands without _min.sdf and _uff.sdf. ) , I also notice "these crystal ligands were
optimized independently of the receptor using the UFF forcefield of RDKit to replicate the conformer generation
process and then minimized with respect to the receptor using smina." in the paper "Visualizing Convolutional Neural Network Protein-Ligand Scoring". So I guess the uff.sdf is created by UFF forcefield of RDKit, then it is used for minimization. But I am eager to know how to create uff.sdf by rdkit when adding a new crystal data . I tried for some code like
AllChem.EmbedMolecule(mol)
AllChem.UFFOptimizeMolecule(mol), or UFFOptimizeMoleculeConfs()
But the result I got is quite different with the PDBbind2016 dataset. Could you instruct me for some steps to create the UFF posed and minized poses, please ? Thanks a lot !

I have a question about how to training using train.py.

I want to test learning with only some of my data.

It consists of 1 pdb file and several sdf files.

The types file is written as follows.

ex) 0 5.058014 data/1V4S_rec_test_0.gninatypes data/conf_5899.sdf # 0

Since it is an arbitrary value, rmsd is deleted and the has_rmsd option is also changed to false in model.

But now I am getting the following error

Traceback (most recent call last):
  File "train.py", line 932, in <module>
    results = train_and_test_model(args, train_test_files[i], outname, cont)
  File "train.py", line 499, in train_and_test_model
    solver.step(test_interval)
ValueError: No valid stratified examples.

My guess is that it's a problem with the stratify_receptor option.

If the corresponding option is modified to false, the following error is changed.

ValueError: No valid examples found in training set.

I thought it might be a problem with the recognition of the pdb file, so I tried changing the pdb file to gninatypes using gninatyper , but the same error occurs.

Any solution?

Dependencies: Caffe

I installed all the Dependencies following the README.md, without CAFFE. When I tried to run the train.py, I found there was also a dependence on Caffe. So I tried to install Caffe in the environment, but there would be a conflict with other packages. Could anyone tell me a solution, thanks a lot! :)

Bad performance when there is no RMSD information

Here, I try to contrast two experiments. normal training VS training without RMSD. I thought as long as the label and affinity label is given, the training wouldn't be different a lot. However, the RMSD-free training resulted in a bizarre performance:

I used the same args and gninatypes files to train model from crossdock_default2018.caffemodel using default2018.model(modified).
The rmsd columns in RMSD-free types are removed, and it's like:

0 3.906 pdb2019_refi_train_gninatypes/4u6w/4u6w_rec.gninatypes redock_default2018_pdbbind_v2019_docked_gninatypes/4u6w_docked_7.gninatypes
1 5.47 pdb2019_refi_train_gninatypes/1gi1/1gi1_rec.gninatypes pdb2019_refi_train_gninatypes/1gi1/1gi1_ligand.gninatypes

And this is the model data layer, I comment the top rmsd_true; In test I set has_rmsd false; In train I set balanced true, stratify_receptor false, has_rmsd false:

layer {
  name: "data"
  type: "MolGridData"
  top: "data"
  top: "label"
  top: "affinity"
  # top: "rmsd_true"
  include {
    phase: TEST
  }
  molgrid_data_param {
        source: "TESTFILE"
        batch_size: 50
        dimension: 23.5
        resolution: 0.500000
        shuffle: false
        ligmap: "completelig"
        recmap: "completerec"
        balanced: false
        has_affinity: true
        has_rmsd: false
        root_folder: "DATA_ROOT"
    }
  }
  
layer {
  name: "data"
  type: "MolGridData"
  top: "data"
  top: "label"
  top: "affinity"
  # top: "rmsd_true"
  include {
    phase: TRAIN
  }
  molgrid_data_param {
        source: "TRAINFILE"
        batch_size:  50
        dimension: 23.5
        resolution: 0.500000
        shuffle: true
        balanced: true
        jitter: 0.000000
        ligmap: "completelig"
        recmap: "completerec"        
        stratify_receptor: false
        stratify_affinity_min: 0
        stratify_affinity_max: 0
        stratify_affinity_step: 1.000000
        has_affinity: true
        has_rmsd: false
        random_rotation: true
        random_translate: 6
        root_folder: "DATA_ROOT"       
    }
}

And the rmsd layer is also deleted.

layer {
  name: "rmsd"
  type: "AffinityLoss"
  bottom: "affinity_output"
  bottom: "affinity"
  top: "rmsd"
...

RuntimeError: invalid device function in train.py

I built a new image with the most recently uploaded dockerfile . ( ubuntu18.04 - cuda11.4 - cudnn8

After adding gnina/caffe/python to the PythonPATH, it was confirmed that the caffe installation was successful.

I deleted the ligcache/reccache part to use default2017 among models/crossdocked_paper model.

command :
python3 train.py -m ../models/crossdocked_paper/default2017_modify.model -p ../models/data/PDBBind2016/General_types/gen_uff_ -d ../data/

However, an error occurs when training existing data using train.py .

error :
I0622 06:02:44.843400 8979 solver.cpp:57] Solver scaffolding done.
I0622 06:02:44.858086 8979 solver.cpp:352] Iteration 0, Testing net (#0)
I0622 06:02:44.858325 8979 solver.cpp:352] Iteration 0, Testing net (#1)
Traceback (most recent call last):
File "train.py", line 932, in
results = train_and_test_model(args, train_test_files[i], outname, cont)
File "train.py", line 499, in train_and_test_model
solver.step(test_interval)
RuntimeError: invalid device function

Can you tell me the cause?