Code Monkey home page Code Monkey logo

equibind's Introduction

EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction

Before using EquiBind, also consider checking out our new approach called DiffDock which improves over EquiBind in multiple ways. The DiffDock GitHub and paper.

EquiBind, is a SE(3)-equivariant geometric deep learning model performing direct-shot prediction of both i) the receptor binding location (blind docking) and ii) the ligand’s bound pose and orientation. EquiBind achieves significant speed-ups compared to traditional and recent baselines. If you have questions, don't hesitate to open an issue or ask me via [email protected] or social media or Octavian Ganea via [email protected]. We are happy to hear from you!

Dataset

Our preprocessed data (see dataset section in the paper Appendix) is available from zenodo.
The files in data contain the names for the time-based data split.

If you want to train one of our models with the data then:

  1. download it from zenodo
  2. unzip the directory and place it into data such that you have the path data/PDBBind

Use provided model weights to predict binding structure of your own protein-ligand pairs:

Step 1: What you need as input

Ligand files of the formats .mol2 or .sdf or .pdbqt or .pdb whose names contain the string ligand (your ligand files should contain all hydrogens).
Receptor files of the format .pdb whose names contain the string protein. We ran reduce on our training proteins. Maybe you also want to run it on your protein.
For each complex you want to predict you need a directory containing the ligand and receptor file. Like this:

my_data_folder
└───name1
    │   name1_protein.pdb
    │   name1_ligand.sdf
└───name2
    │   name2_protein.pdb
    │   name2_ligand.mol2
...

Step 2: Setup Environment

We will set up the environment using Anaconda. Clone the current repo

git clone https://github.com/HannesStark/EquiBind

Create a new environment with all required packages using environment.yml. If you have a CUDA GPU run:

conda env create -f environment.yml

If you instead only have a CPU run:

conda env create -f environment_cpuonly.yml

Activate the environment

conda activate equibind

Here are the requirements themselves for the case with a CUDA GPU if you want to install them manually instead of using the environment.yml:

python=3.7
pytorch 1.10
torchvision
cudatoolkit=10.2
torchaudio
dgl-cuda10.2
rdkit
openbabel
biopython
rdkit
biopandas
pot
dgllife
joblib
pyaml
icecream
matplotlib
tensorboard

Step 3: Predict Binding Structures!

In the config file configs_clean/inference.yml set the path to your input data folder inference_path: path_to/my_data_folder.
Then run:

python inference.py --config=configs_clean/inference.yml

Done! 🎉
Your results are saved as .sdf files in the directory specified in the config file under output_directory: 'data/results/output' and as tensors at runs/flexible_self_docking/predictions_RDKitFalse.pt!

Inference for multiple ligands in the same .sdf file and a single receptor

python multiligand_infernce.py -o path/to/output_directory -r path/to/receptor.pdb -l path/to/ligands.sdf

This runs EquiBind on every ligand in ligands.sdf against the protein in receptor.pdb. The outputs are 3 files in output_directory with the following names and contents:

failed.txt - contains the index (in the file ligands.sdf) and name of every molecule for which inference failed in a way that was caught and handled.
success.txt - contains the index (in the file ligands.sdf) and name of every molecule for which inference succeeded.
output.sdf - contains the conformers produced by EquiBind in .sdf format.

Reproducing paper numbers

Download the data and place it as described in the "Dataset" section above.

Using the provided model weights

To predict binding structures using the provided model weights run:

python inference.py --config=configs_clean/inference_file_for_reproduce.yml

This will give you the results of EquiBind-U and then those of EquiBind after running the fast ligand point cloud fitting corrections.
The numbers are a bit better than what is reported in the paper. We will put the improved numbers into the next update of the paper.

Training a model yourself and using those weights

To train the model yourself, run:

python train.py --config=configs_clean/RDKitCoords_flexible_self_docking.yml

The model weights are saved in the runs directory.
You can also start a tensorboard server tensorboard --logdir=runs and watch the model train.
To evaluate the model on the test set, change the run_dirs: entry of the config file inference_file_for_reproduce.yml to point to the directory produced in runs. Then you can runpython inference.py --config=configs_clean/inference_file_for_reproduce.yml as above!

Reference

📃 Paper on arXiv

@inproceedings{equibind,
  title={Equibind: Geometric deep learning for drug binding structure prediction},
  author={St{\"a}rk, Hannes and Ganea, Octavian and Pattanaik, Lagnajit and Barzilay, Regina and Jaakkola, Tommi},
  booktitle={International Conference on Machine Learning},
  pages={20503--20521},
  year={2022},
  organization={PMLR}
}

equibind's People

Contributors

amfaber avatar hannesstark avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

equibind's Issues

Inference should be able to be run from any directory?

Running inference.py from a different project directory, gives this error. Would it be possible to add code to allow this to be run from anywhere?

Traceback (most recent call last):
  File "/home/jadolfbr/EquiBind/inference.py", line 460, in <module>
    with open(os.path.join(os.path.dirname(args.checkpoint), 'train_arguments.yaml'), 'r') as arg_file:
FileNotFoundError: [Errno 2] No such file or directory: 'runs/flexible_self_docking/train_arguments.yaml'
Traceback (most recent call last):
  File "/home/jadolfbr/EquiBind/inference.py", line 460, in <module>
    with open(os.path.join(os.path.dirname(args.checkpoint), 'train_arguments.yaml'), 'r') as arg_file:
FileNotFoundError: [Errno 2] No such file or directory: 'runs/flexible_self_docking/train_arguments.yaml'
Traceback (most recent call last):
  File "/home/jadolfbr/EquiBind/inference.py", line 460, in <module>
    with open(os.path.join(os.path.dirname(args.checkpoint), 'train_arguments.yaml'), 'r') as arg_file:
FileNotFoundError: [Errno 2] No such file or directory: 'runs/flexible_self_docking/train_arguments.yaml'

Inference on my own PDB throws receptor error.

Processing BRD4: complex 1 of 1
Trying to load data/predict/BRD4/BRD4_ligand.sdf
Docking the receptor data/predict/BRD4/BRD4_protein.pdb
To the ligand data/predict/BRD4/BRD4_ligand.sdf
Traceback (most recent call last):
File "inference.py", line 471, in <module>
inference_from_files(args)
File "inference.py", line 339, in inference_from_files
rec, rec_coords, c_alpha_coords, n_coords, c_coords = get_receptor(rec_path, lig, cutoff=dp['chain_radius'])
File "/home/joshua-talo/GitHub/Python/EquiBind/commons/process_mols.py", line 373, in get_receptor
c_alpha_coords = np.concatenate(valid_c_alpha_coords, axis=0) # [n_residues, 3]
File "<__array_function__ internals>", line 5, in concatenate
ValueError: need at least one array to concatenate

The SDF file is generated using open babel of a drug of interest which I wish to dock, while the PDB was obtained from https://www.rcsb.org/structure/3mxf where I removed the docked JQ1 molecule as well as other hetero atoms used for crystallography. For some reason this error only gets thrown when I supply an SDF generated from other tools but when I used the SDF file from the PDB link itself, with a proximity at the ligand binding site, it works to generate the predicted pose.

Looking into the source code, I think the get_receptor function is searching for receptor atoms and their coordinates within a certain proximity of the ligand SDF coordinates. If this is the case, I think it is unable to find the receptor’s alpha carbon atoms and therapy has nothing to concatenate. Am I right regarding this?

Also, may I know how I could go about doing full blind docking with SDF files of ligands generated from other tools to find novel binding sites on a protein?

Thank you.

Torch not compiled with CUDA enabled

Hello,when I run the multiligand_inference.py , it prompts this error:

python multiligand_inference.py -o ./my_data_folder/result/ -r ./my_data_folder/multiligand-test/5v4q_protein.pdb -l ./my_data_folder/multiligand-test/ligand.sdf

Namespace(batch_size=8, checkpoint=None, config=None, device='cpu', lazy_dataload=None, lig_slice=None, ligands_sdf='./my_data_folder/multiligand-test/ligand.sdf', n_workers_data_load=0, num_confs=1, output_directory='./my_data_folder/result/', rec_pdb='./my_data_folder/multiligand-test/5v4q_protein.pdb', run_corrections=True, seed=1, skip_in_output=True, train_args=None, use_rdkit_coords=False)
[2022-07-08 10:34:33.719185] [ Using Seed : 1 ]
Found 0 previously calculated ligands
device = cpu
Entering batch ending in index 5/5
Traceback (most recent call last):
File "multiligand_inference.py", line 278, in
main()
File "multiligand_inference.py", line 275, in main
write_while_inferring(lig_loader, model, args)
File "multiligand_inference.py", line 217, in write_while_inferring
lig_graphs = lig_graphs.to(args.device)
File "/data/anaconda/envs/equibind/lib/python3.7/site-packages/dgl/heterograph.py", line 5448, in to
ret._graph = self._graph.copy_to(utils.to_dgl_context(device))
File "/data/anaconda/envs/equibind/lib/python3.7/site-packages/dgl/utils/internal.py", line 533, in to_dgl_context
device_id = F.device_id(ctx)
File "/data/anaconda/envs/equibind/lib/python3.7/site-packages/dgl/backend/pytorch/tensor.py", line 90, in device_id
return 0 if ctx.type == 'cpu' else th.cuda.current_device()
File "/data/anaconda/envs/equibind/lib/python3.7/site-packages/torch/cuda/init.py", line 479, in current_device
_lazy_init()
File "/data/anaconda/envs/equibind/lib/python3.7/site-packages/torch/cuda/init.py", line 208, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

How can solve this error?

Scale off for results

The program runs without issue, but the scale of the SDF molecule does not match the input scale/protein.
Have you seen this before? Is there any recourse here?

Screen Shot 2022-06-16 at 1 10 33 PM

Screen Shot 2022-06-16 at 1 10 16 PM

dgl issue

Dear all,

I created the given environment on a windows computer and experience a dgl error.

(equibind) PS C:\Users\reps\EquiBind> python inference.py --config=configs_clean/inference_file_for_reproduce.yml

Using backend: pytorch
[2022-02-13 09:04:14.926225] [ Using Seed : 1 ]
Traceback (most recent call last):
File "inference.py", line 459, in
inference(args)
File "inference.py", line 121, in inference
seed_all(args.seed)
File "C:\Users\reps\EquiBind\commons\utils.py", line 62, in seed_all
dgl.random.seed(seed)
File "C:\Users.conda\envs\equibind\lib\site-packages\dgl\random.py", line 18, in seed
_CAPI_SetSeed(val)
File "C:\Users.conda\envs\equibind\lib\site-packages\dgl_ffi_ctypes\function.py", line 190, in call
ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
File "C:\Users.conda\envs\equibind\lib\site-packages\dgl_ffi\base.py", line 64, in check_call
raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: [09:04:14] C:\Users\Administrator\dgl-0.5\src\random\random.cc:34: Check failed: e == CURAND_STATUS_SUCCESS: CURAND Error: CURAND_STATUS_INITIALIZATION_FAILED at C:\Users\Administrator\dgl-0.5\src\random\random.cc:34

Question regarding DGL

Hi Hannes,

Thank you for this really cool project, as well as contributions to this area of computational drug discovery.

I actually have two questions but I split them into two issues so they can be addressed separately.

This first question is regarding DGL, which I am wondering is this a Graph Deep Learning library that is analogous to TorchGeometric? Is this being used as a substitute to TorchGeometric?

Also, I am running into this error
/opt/dgl/src/runtime/tensordispatch.cc:43: TensorDispatcher: dlopen failed: /home/joshua-talo/.conda/envs/equibind/lib/python3.8/site-packages/dgl/tensoradapter/pytorch/libtensoradapter_pytorch_1.10.2.so: cannot open shared object file: No such file or directory

Is there some dependency mismatch that I might be missing?

Thank you.

Regards,
Joshua

Baseline details, global docking configuration, GLIDE.

Hello,
In the paper, you mentioned the GLIDE runtime, but we couldn't find the setting for GLIDE, and other baselines.
Since those methods typically perform a local docking, requiring the specification of pocket center, and box sizes, it will be great if you can provide more details for performing global docking using GLIDE and others.
Thanks!

Questions about reproducing

Hi,
When I run the command python train.py --config=configs_clean/RDKitCoords_flexible_self_docking.yml, the following error will always be reported after processing part of the receptor data. But when I reduce the training dataset to one tenth of that in this article, it can work normally. I am very confused about this. Do you know where the problem is?

joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/.local/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 407, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "/.conda/envs/equidock/lib/python3.9/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
RuntimeError: invalid value in pickle
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/EquiBind/train.py", line 304, in <module>
    main_function()
  File "/EquiBind/train.py", line 295, in main_function
    train_wrapper(args)
  File "/EquiBind/train.py", line 141, in train_wrapper
    return train(args, run_dir)
  File "/EquiBind/train.py", line 169, in train
    train_data = PDBBind(device=device, complex_names_path=args.train_names,lig_predictions_name=args.train_predictions_name, is_train_data=True, **args.dataset_params)
  File "/EquiBind/datasets/pdbbind.py", line 127, in __init__
    self.process()
  File "/EquiBind/datasets/pdbbind.py", line 252, in process
    receptor_representatives = pmap_multi(get_receptor, zip(rec_paths, ligs), n_jobs=self.n_jobs, cutoff=self.chain_radius, desc='Get receptors')
  File "/EquiBind/commons/utils.py", line 46, in pmap_multi
    results = Parallel(n_jobs=n_jobs, verbose=verbose, timeout=None)(
  File "/.local/lib/python3.9/site-packages/joblib/parallel.py", line 1056, in __call__
    self.retrieve()
  File "/.local/lib/python3.9/site-packages/joblib/parallel.py", line 935, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/.local/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/.conda/envs/equidock/lib/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/.conda/envs/equidock/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/.local/lib/python3.9/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
    callback(self)
  File "/.local/lib/python3.9/site-packages/joblib/parallel.py", line 359, in __call__
    self.parallel.dispatch_next()
  File "/.local/lib/python3.9/site-packages/joblib/parallel.py", line 794, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "/.local/lib/python3.9/site-packages/joblib/parallel.py", line 861, in dispatch_one_batch
    self._dispatch(tasks)
  File "/.local/lib/python3.9/site-packages/joblib/parallel.py", line 779, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/.local/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 531, in apply_async
    future = self._workers.submit(SafeFunction(func))
  File "/.local/lib/python3.9/site-packages/joblib/externals/loky/reusable_executor.py", line 177, in submit
    return super(_ReusablePoolExecutor, self).submit(
  File "/.local/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 1115, in submit
    raise self._flags.broken
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

Anyway, thank you very much for your help. EquiBind is a great work!

Calculating corrected intersection losses at inference

Hi,

When performing inference (from files), the untuned losses are calculated and reported in all_intersection_losses_untuned. However, the optimized losses never seem to be calculated and reported.

In inference.py can we use something a long the lines of:

all_intersection_losses_untuned.append(
                    compute_revised_intersection_loss(lig_coords_pred_untuned.detach().cpu(), rec_graph.ndata['x'],
                                                        alpha=0.2, beta=8, aggression=0))
# After calculating coords_pred_optimized:
all_intersection_losses.append(compute_revised_intersection_loss(coords_pred_optimized, rec_graph.ndata['x'], alpha=0.2, beta=8, aggression=0))

Many thanks!

Support for multi-ligand SDF files

If I supply a ligand SDF file as input, the output SDF file has only the first ligand from the input. Is this to be expected? If my goal is to dock a bunch of ligands, is the best thing to provide multiple name1_ligand.sdf files?

142 instead 144 new receptors.

Hello,
Great work.
Could you please provide the pdb list of your 144 new receptors?
following your train/valid, and removing duplicated uniprot ID, I got 142 instead 144 pdbs.
I'm not sure how does this difference arise.

FileNotFoundError: train_arguments.yaml

Hi I am attempting to the run this software on an Ubuntu Virtual Machine. Setting up the environment went smoothly. However when I try to run the inference.py script I get the following error:

Traceback (most recent call last):
  File "/home/tony/Documents/EquiBind-main/inference.py", line 460, in <module>
    with open(os.path.join(os.path.dirname(args.checkpoint), 'train_arguments.yaml'), 'r') as arg_file:
FileNotFoundError: [Errno 2] No such file or directory: 'runs/flexible_self_docking/train_arguments.yaml'

It seems like the concatenation of the path for the train_arguments.yaml is not working correctly, hopefully this is quite an easy fix though?

Thanks in advance for your help.

Support for multiple random rdkit confirmations for each ligand

lig_graph = deepcopy(self.lig_graphs[idx][self.conformer_id])

Hi! I was wondering if there is currently support for using multiple confirmations per ligand in the training process. In the above snippet, it looks like there is but self.conformer_id is fixed so only the first of a mols confirmations will be used. If I wanted to add support for multiple conformations could I just change self.conformer_id to be a random sample of size num_confs or would this break other things in the pipeline?

Not able to run with python 2. DGL-cuda without CUDA GPU.

python inference.py --config=configs_clean/inference.yml
File "inference.py", line 119
sys.stdout = Logger(logpath=os.path.join(os.path.dirname(args.checkpoint), f'inference.log'), syspart=sys.stdout)
^
SyntaxError: invalid syntax

Potential improvement through avoiding DGL TensorDispatcher error

I encountered this error when running Equibind.

/opt/dgl/src/runtime/tensordispatch.cc:43: TensorDispatcher: dlopen failed: libtorch_cuda.so: cannot open shared object file: No such file or directory

According to this https://discuss.dgl.ai/t/error-tensordispatcher/2468 the error impacts performance but does not impact usage. Equibind appears to run in spite of this error. I thought you might want to take a look.

Ubuntu 18.04.6 LTS
Installed according to the readme with conda env create -f environment.yml

Demo run failed

Dear author,

I just installed EquiBind following the tutorial, but I failed to run the demo, could you help?

python inference.py --config=configs_clean/inference.yml

[2022-03-09 18:32:58.749212] [ Using Seed :  1  ]
Traceback (most recent call last):
  File "inference.py", line 473, in <module>
    inference_from_files(args)
  File "inference.py", line 306, in inference_from_files
    seed_all(args.seed)
  File "/home/conda/EquiBind-main/commons/utils.py", line 62, in seed_all
    dgl.random.seed(seed)
  File "/tools/anaconda3/envs/equibind/lib/python3.7/site-packages/dgl/random.py", line 18, in seed
    _CAPI_SetSeed(val)
  File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 222, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./function.pxi", line 211, in dgl._ffi._cy3.core.FuncCall3
  File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [18:32:58] /opt/dgl/src/random/random.cc:34: Check failed: e == CURAND_STATUS_SUCCESS: CURAND Error: CURAND_STATUS_INITIALIZATION_FAILED at /opt/dgl/src/random/random.cc:34

Best regards
Zhenting

reproduction of paper results and ligand/protein prep

Hi --

Thanks for the fantastic effort. I would like to reproduce the results you got for the structures found in figure 14, but the PDB ID's are not available in the manuscript.

I tried to reproduce the equibind pose for Imatinib in the manuscript, but after several attempts I can get nothing close to the manuscript. My suspicion is that my preparation of the PDB file to isolated ligand and protein may be different than what you are doing.

I made a modification of the Colab notebook found here:
https://twitter.com/pablitoarantes/status/1548371667600101377?s=20&t=wXP5A7Qavf7nNykJw3gFHA

My modification includes the use of reduce, so I assume there shouldn't be much difference between what you are doing and I am doing:

https://github.com/abazabaaa/colab_tutorial/blob/main/equibind_imatinib.ipynb

** I realized I put the wrong pdb id in here to test things. That being said, for the sake of clarity, the questions below would be helpful for debugging and making sure I am running your code correctly.

Questions:

  1. Can you provide the PDB id's from figure 14? It would really help ensure I am setting things up correctly.

  2. Can you provide the code that was used to isolate ligands and proteins from raw PDB files downloaded from the PDB? I have a feeling there may be some details in there that are important and I am missing them (I would guess you strip ions/co-factors, but I didn't see that in the paper text).

Questions about SE(3)-equivariant

Hi,

I'm still a newbie in ligand-receptor binding. I did an experiment on translation equivariance in the 5ol3 complex in PDBBind dataset. I experimented with a translation of both ligand and receptor of the 5ol3 complex by shifting1 Å along the y-axis and shifting for the ligand only. However, the results showed that the binding poses were different for no shifting (the control group), shifting ligand, and shifting both ligand and receptor. The figure below shows the conformations of the molecules which inferenced by the model. It does not look like the model guarantees SE(3)-equivariant. I would like to understand the reason why these three conformations are not similar.

Many Thanks!

5ol3_ligand_inferenced_by_equibind

joblib Parallel issue with specific complex '3m1s' in the PDBBind data

It could be a problem with RDKit. My version is "rdkit 2021.09.4".

import pickle
pdbbind_dir = "PDBBind_processed/"
name = '3m1s'
lig = read_molecule(os.path.join(pdbbind_dir, name, f'{name}_ligand.sdf'), sanitize=True,
                    remove_hs=True)
if lig == None:  # read mol2 file if sdf file cannot be sanitized
    lig = read_molecule(os.path.join(pdbbind_dir, name, f'{name}_ligand.mol2'), sanitize=True,
                        remove_hs=True)
# lig = Chem.MolFromSmiles('O=C[Ru+9]12345(C6=C1C2C3=C64)n1c2ccc(O)cc2c2c3c(c4ccc[n+]5c4c21)C(=O)NC3=O')
pickle.dump(lig, open("test.pkl", "bw"))
pickle.load(open("test.pkl", "rb"))

RuntimeError: invalid value in pickle

c

Hi, How do you guys interpret the results? the output.sdf is empty. The run was successfully done. Thank you!

image

About hydrogen removement

Thanks for your great contribution! I want to reproduce your paper's result and I find your paper says all the metrics are calculated after removing hydrogen, but in the training config remove_h is false. Is the final result reproduced by setting this as true?

ValueError: list.remove(x): x not in list in process_mols.py line 692, in get_geometry_graph

I get the following error for a specific ligand-protein pair:

Trying to load ligand_106.sdf
Docking the receptor protein_A.pdb
To the ligand ligand_106.sdf
Traceback (most recent call last):
  File "inference.py", line 473, in <module>
    inference_from_files(args)
  File "inference.py", line 350, in inference_from_files
    geometry_graph = get_geometry_graph(lig)
  File "/home/moritz/Projects/EquiBind/commons/process_mols.py", line 692, in get_geometry_graph
    all_dst_idx.remove(src_idx)
ValueError: list.remove(x): x not in list

(I followed the instruction in the README and ran python inference.py --config=configs_clean/inference.yml)

Error when run get_receptor_inference

First of all, Your research has had a huge impact on drug discovery based on AI. Thank you so much!!!

I got the error like below.
My Inputs are protein pdb from https://www.rcsb.org/ and 3d ligand conformer from PubChem.

[2022-04-06 19:21:16.672575] [ Using Seed :  1  ]

Processing SOS1: complex 1 of 1
Trying to load data/my_data_folder/SOS1/SOS1_ligand.sdf
Docking the receptor data/my_data_folder/SOS1/SOS1_protein.pdb
To the ligand data/my_data_folder/SOS1/SOS1_ligand.sdf
Traceback (most recent call last):
  File "inference.py", line 473, in <module>
    inference_from_files(args)
  File "inference.py", line 340, in inference_from_files
    rec, rec_coords, c_alpha_coords, n_coords, c_coords = get_receptor_inference(rec_path)
  File "/home/sejeong/codes/EquiBind/commons/process_mols.py", line 421, in get_receptor_inference
    c_alpha_coords = np.concatenate(valid_c_alpha_coords, axis=0)  # [n_residues, 3]
  File "<__array_function__ internals>", line 6, in concatenate
ValueError: need at least one array to concatenate

Is there any problem in my input setting?
I would be very grateful if you could give me a solution.

Thank you for your work again. :)

Could you provide QuickVina2 and Smina finetuning script?

To my understanding, Smina will first randomly place the ligand in the pocket and do Monte Carlo searching based on energy function. In another word, the result of EQUIBINDS should be as same as Smina, but the result of the paper showed there is a huge improvement from 11.6->19.6.

Could you provide QuickVina2 and Smina finetuning script?
I am curious.


It is my guess below.
Your experiment pipline should be rigid_redocking model first, then using smina to do the finetuning. In this pipeline, actually maybe because of the ligand's initial conformer is exactly the one in the co-crystal ligand pose (rotatable bond is correct), and the Smina result(11.6) is only using rdkit conformer (rotatable bond may not be correct), finally, it causes different result.
The different rotatable bond of ligand will casue very different docking results while using traditional docking software.

Hydrogens on receptors

Hi Hannes, thanks for the amazing tool :) In the paper, you note that you add hydrogens to the receptors using reduce. However, my understanding is that EquiBind uses a graph of the residues to represent the protein. What is the purpose of adding hydrogens to the receptor, and (more importantly), should I be doing the same when running inference? Thanks again

On an unrelated note, I saw a request for support for multi-ligand files. I've created my own implementation for that exact purpose, which is able to handle large sdf and smiles files lazily, running inference as the file is processed and writing the results along the way without needing to load the entire file at once. Would you be interested in a pull request?

EquiBind result analysis

After a successful dock pose between a certain ligand and a receptor is generated by EquiBind, will we be able to further investigate the binding site? Like, will we be able to see what specific types of interactions made between ligand and receptor? Will be the task of identification of the residue on the primary structure that makes those interactions possible.

psutils error

I got the following error. any idea?
No module named 'psutil'

Bad Clashes

I was able to get Equibind running on GPU and CPU versions, however, I get pretty bad clashes using the default model. The models generally look like they could be plausible as a centroid, however. Do you have any advice for how to best remove these clashes? In the paper, you mention Equibind+S. Is this done as two separate tasks or do you integrate it somehow? Is this generally what you would recommend, or are their options you would recommend to try, especially for bigger ligands such as this molecule (peptide at about 180 atoms including hydrogens)?

Finally, if the general recommendation is Equibind->SMINA, which it seems to be, do you have cmd-line arguments used in your benchmarking for SMINA? I could not find a paper supplement with these arguments.

Example of a bad clash in the ligand is below.
Screen Shot 2022-06-22 at 12 51 08 PM

run within certain box

Could you clarify please if there is a possibility in the current version to perform this procedure within certain box (not blind scenario)?

Thank you!

Measure of inference quality

Hi,

First of all, thanks for developing this method - it is really something new.

When using the inference_from_files mode to infer various ligand conformations on several protein targets, how does the tool report some measure of inference quality"? Do you have some measure of conformation fit to the target?
I can see intersection_losses_untuned being reported - can it be used?

Many thanks!

Corrected SDF files positions appear to be incorrect

Hello - I am running EquiBind to calculate binding locations for a couple of hundred ligand SDF files. The process completes without errors and the outputted _corrected.sdf files look just like the original ligand SDF with modified x,y,z coordinates. However, when I plot the protein PDB and all the corrected ligands, the ligands all fall inside of the protein structure, rather than being distributed and connected to the protein strand at different locations. This clustering of the ligands at the geometric center of the protein structure does not make sense to me intuitively, so I am trying to figure out what is going on. Any thoughts on why EquiBind gives me coordinates that bulk up at the center of the protein structure?

[Question] Graph representations

Hi, really interesting paper. Could you please clarify something for me: when creating edges for the ligand graph, do you use both k-NN and distance threshold? Or do you use only the distance threshold for the ligand, and k-NN for the protein? Thanks!

Can't kekulize mol. Unkekulized atoms

I got a lot of the following errors. Any idea what went wrong? Thanks.

[15:06:44] Can't kekulize mol. Unkekulized atoms: 14 15 16 17 18
[15:06:44] Can't kekulize mol. Unkekulized atoms: 8 9 10 11 12 13 14 15 16
[15:06:44] Can't kekulize mol. Unkekulized atoms: 40 41 42 43 44
....

update: Reinstall of psutil fixed the problem.

invalid value in pickle

loading ligands: 100%|##########| 16379/16379 [00:07<00:00, 2120.18it/s]
[2022-06-20 17:45:06.850970] Get receptors, filter chains, and get its coordinates
Get receptors: 0it [00:00, ?it/s][Parallel(n_jobs=20)]: Using backend LokyBackend with 20 concurrent workers.
Get receptors: 40it [00:02, 15.94it/s][Parallel(n_jobs=20)]: Done 10 tasks | elapsed: 3.3s
Get receptors: 180it [00:18, 9.37it/s][Parallel(n_jobs=20)]: Done 160 tasks | elapsed: 19.3s
Get receptors: 440it [00:48, 7.08it/s][Parallel(n_jobs=20)]: Done 410 tasks | elapsed: 49.5s
Get receptors: 780it [01:33, 6.45it/s][Parallel(n_jobs=20)]: Done 760 tasks | elapsed: 1.6min
Get receptors: 1240it [02:34, 6.44it/s][Parallel(n_jobs=20)]: Done 1210 tasks | elapsed: 2.6min
Get receptors: 1780it [03:40, 9.20it/s][Parallel(n_jobs=20)]: Done 1760 tasks | elapsed: 3.7min
Get receptors: 2440it [05:06, 11.94it/s][Parallel(n_jobs=20)]: Done 2410 tasks | elapsed: 5.1min
Get receptors: 3180it [06:55, 3.39it/s][Parallel(n_jobs=20)]: Done 3160 tasks | elapsed: 6.9min
Get receptors: 4040it [08:31, 11.87it/s][Parallel(n_jobs=20)]: Done 4010 tasks | elapsed: 8.5min
Get receptors: 4980it [10:24, 13.37it/s][Parallel(n_jobs=20)]: Done 4960 tasks | elapsed: 10.4min
Get receptors: 5280it [11:34, 12.19it/s]exception calling callback for <Future at 0x7f9e7b119c50 state=finished raised BrokenProcessPool>
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 407, in _process_worker
call_item = call_queue.get(block=True, timeout=timeout)
File "/opt/miniconda3/envs/equibind/lib/python3.7/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
RuntimeError: invalid value in pickle
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
callback(self)
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/parallel.py", line 359, in call
self.parallel.dispatch_next()
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/parallel.py", line 794, in dispatch_next
if not self.dispatch_one_batch(self._original_iterator):
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/parallel.py", line 861, in dispatch_one_batch
self._dispatch(tasks)
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/parallel.py", line 779, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 531, in apply_async
future = self._workers.submit(SafeFunction(func))
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/externals/loky/reusable_executor.py", line 178, in submit
fn, *args, **kwargs)
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 1115, in submit
raise self._flags.broken
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
Get receptors: 5300it [11:35, 14.89it/s]joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 407, in _process_worker
call_item = call_queue.get(block=True, timeout=timeout)
File "/opt/miniconda3/envs/equibind/lib/python3.7/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
RuntimeError: invalid value in pickle
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "train.py", line 304, in
main_function()
File "train.py", line 295, in main_function
train_wrapper(args)
File "train.py", line 141, in train_wrapper
return train(args, run_dir)
File "train.py", line 169, in train
train_data = PDBBind(device=device, complex_names_path=args.train_names,lig_predictions_name=args.train_predictions_name, is_train_data=True, **args.dataset_params)
File "/root/code/EquiBind/datasets/pdbbind.py", line 127, in init
self.process()
File "/root/code/EquiBind/datasets/pdbbind.py", line 254, in process
receptor_representatives = pmap_multi(get_receptor, zip(rec_paths, ligs), n_jobs=self.n_jobs, cutoff=self.chain_radius, desc='Get receptors')
File "/root/code/EquiBind/commons/utils.py", line 47, in pmap_multi
delayed(pickleable_fn)(*d, **kwargs) for i, d in tqdm(enumerate(data),desc=desc)
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/parallel.py", line 1056, in call
self.retrieve()
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/parallel.py", line 935, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/opt/miniconda3/envs/equibind/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/opt/miniconda3/envs/equibind/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
callback(self)
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/parallel.py", line 359, in call
self.parallel.dispatch_next()
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/parallel.py", line 794, in dispatch_next
if not self.dispatch_one_batch(self._original_iterator):
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/parallel.py", line 861, in dispatch_one_batch
self._dispatch(tasks)
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/parallel.py", line 779, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 531, in apply_async
future = self._workers.submit(SafeFunction(func))
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/externals/loky/reusable_executor.py", line 178, in submit
fn, *args, **kwargs)
File "/opt/miniconda3/envs/equibind/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 1115, in submit
raise self._flags.broken
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
Get receptors: 5339it [13:02, 6.82it/s]

Duplicate of Issue #13 + The `model_type` parameter in .yml config files

Dear authors,
I just start working with your instrument, I installed cpu-version (conda env create -f environment_cpuonly.yml) and successfully run it. I used 5tgz psb structure as a target. And docked 1743 known inhibitors.
Versions:

rdkit                     2021.09.5 `
openbabel                 3.1.1

I run it by this command:

 python docking/EquiBind/inference.py --config=docking/equibind_run/inference.yml

inference.yml:

run_dirs:
  - flexible_self_docking # the resulting coordinates will be saved here as tensors in a .pt file (but also as .sdf files if you specify an "output_directory" below)
inference_path: 'docking/equibind_run' # this should be your input file path as described in the main readme

test_names: timesplit_test
output_directory: 'docking/equibind_run/output' # the predicted ligands will be saved as .sdf file here
run_corrections: True
use_rdkit_coords: False # generates the coordinates of the ligand with rdkit instead of using the provided conformer. If you already have a 3D structure that you want to use as initial conformer, then lea$
save_trajectories: False

num_confs: 1 # usually this should be 1
seed: 120
device: cpu

Initial conformers were obtained by rdkit ( params = AllChem.ETKDGv3(); AllChem.EmbedMolecule(mol, params)).
All missing hydrogens were added to the ligands (by rdkit) and to the protein structure (by chimera).
As input I used sdf files of ligands and pdb file of protein (put each ligand and the protein to separate directories).
Example of the input files:
CHEMBL1088245_protein.pdb.txt
CHEMBL1088245_ligand.sdf.txt

The problem is that resulted binding poses are incorrect, like ligand's atoms crosses protein's atoms.
lig_equibind_corrected.sdf.txt
Maybe I didn't set some special parameters?
Could you help me please?
Thank you!

Problem with pdb file input

Preprocessed pdb file as suggested in arxiv. Trying to make the prediction but got an error:

Traceback (most recent call last):
File "inference.py", line 471, in
inference_from_files(args)
File "inference.py", line 339, in inference_from_files
rec, rec_coords, c_alpha_coords, n_coords, c_coords = get_receptor(rec_path, lig, cutoff=dp['chain_radius'])
File "E:\EquiBind-main\commons\process_mols.py", line 373, in get_receptor
c_alpha_coords = np.concatenate(valid_c_alpha_coords, axis=0) # [n_residues, 3]
File "<array_function internals>", line 6, in concatenate
ValueError: need at least one array to concatenate

Only ligand sdf file is generated

Hi,

I might have not understood everything correctly, but I expected the output to be the ligand bound to the receptor. However, I only get the ligand, and it slightly clashes with the receptor and with itself. Is this a normal prediction?

Can't Kekulize mol

Hello,

Really cool tool that I hope to be able to use! I'm getting this error for an SDF file that is hydrated and saved from PyMol. Any advice here?

[14:37:23] Can't kekulize mol.  Unkekulized atoms: 47 48 49 50 51 52 53 54 56
Traceback (most recent call last):
  File "/home/jadolfbr/EquiBind/inference.py", line 473, in <module>
    inference_from_files(args)
  File "/home/jadolfbr/EquiBind/inference.py", line 337, in inference_from_files
    if lig == None: raise ValueError(f'None of the ligand files could be read: {lig_names}')
ValueError: None of the ligand files could be read: ['215_ligand.sdf']

Memory requirements for training?

I've been trying to train the model using the RDKitCoords_flexible_self_docking.yml configuration, and after modifying the code based on #6, I am having a RAM overflow problem.

It seems like the code loads receptors into memory — just wondering if this is supposed it happen and take so much memory, and, if so, how much is recommended to run this project. Thanks!

Wonder about the training phrase (whole required training time, etc)

Very impressive work and clean codes!
I just try to retrain the whole model based on the same training settings you recommend (whole train and val set). However, each epoch needs at least 5 mins to finish (batchsize:32, on A100 GPU), I was shocked when saw the total epochs is 100k.
So, may I ask how long did you train the whole model and how many resources need?

Best,

CUDA error

Hi
I have tried to install equibind and create an environment for both CUDA enabled and CPU machines using environment.yml and environment_cpu.yml. I am trying to run a test example using the following command after activating the equibind environment python multiligand_inference.py -o ./test_output/ -r ./test_input/protein.pdb -l ./test_input/ligand.sdf
However, i am getting the same error for both CUDA and CPU-enabled installations:

device = cpu
Entering batch ending in index 8/18
Traceback (most recent call last):
File "multiligand_inference.py", line 275, in
main()
File "multiligand_inference.py", line 272, in main
write_while_inferring(lig_loader, model, args)
File "multiligand_inference.py", line 216, in write_while_inferring
lig_graphs = lig_graphs.to(args.device)
File "/Users/mshekhar/miniconda3/envs/equibind/lib/python3.7/site-packages/dgl/heterograph.py", line 5448, in to
ret._graph = self._graph.copy_to(utils.to_dgl_context(device))
File "/Users/mshekhar/miniconda3/envs/equibind/lib/python3.7/site-packages/dgl/utils/internal.py", line 533, in to_dgl_context
device_id = F.device_id(ctx)
File "/Users/mshekhar/miniconda3/envs/equibind/lib/python3.7/site-packages/dgl/backend/pytorch/tensor.py", line 90, in device_id
return 0 if ctx.type == 'cpu' else th.cuda.current_device()
File "/Users/mshekhar/miniconda3/envs/equibind/lib/python3.7/site-packages/torch/cuda/init.py", line 479, in current_device
_lazy_init()
File "/Users/mshekhar/miniconda3/envs/equibind/lib/python3.7/site-packages/torch/cuda/init.py", line 208, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Please help
Regards
Mrinal

Paper Typo

Hi, thank you for this really nice paper. This is just a quick note to say there is a typo in the current arXiv version. In this sentence on the final page, "weather" should be replaced by "whether."

image

Is is possible to predict torsion angle directly from model?

I reviewed the paper GeoMol which is cited in the 3.2.2 section of the paper.
That is a model can predict torsion angle directly. But the Equibind only outputs an approximate coordinate of docked ligand, then aligning torsion angle from rdkit conformer to the predicted coords, in order to avoid the problem of incorrect docked ligand conformer (wrong bond length and bond angle).

So why don't you guys just predict the torsion angle from the Equibind model directly? is it a possible strategy?

Thanks. It is a wonderful job.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.