Code Monkey home page Code Monkey logo

tankbind's Introduction

header

Source code for the NeurIPS 2022 paper TANKBind: Trigonometry-Aware Neural NetworKs for Drug-Protein Binding Structure Prediction

TankBind

TankBind could predict both the protein-ligand binding structure and their affinity.

The primary purpose of this repository is to enable the reproduction of the results reported in the paper, as well as to facilitate the work of others who wish to build upon it. To experience the latest version, which includes various improvements made to the model, simply create an account at https://m1.galixir.com/public/login_en/index.html.

If you have any question or suggestion, please feel free to open an issue or email me at [email protected] or shuangjia zheng at [email protected].

Installation

conda create -n tankbind_py38 python=3.8
conda activate tankbind_py38

You might want to change the cudatoolkit version based on the GPU you are using.:

conda install pytorch cudatoolkit=11.3 -c pytorch
conda install torchdrug=0.1.2 pyg=2.1.0 biopython nglview jupyterlab -c milagraph -c conda-forge -c pytorch -c pyg
pip install torchmetrics tqdm mlcrate pyarrow
rdkit version used: 2021.03.4

p2rank v2.3 could be downloaded from here:

https://github.com/rdk/p2rank/releases/download/2.3/p2rank_2.3.tar.gz

Test set evaluation

We include the script for reproducing the self-dock result in

examples/testset_evaluation_cleaned.ipynb

The test_dataset is constructed using the notebook in "Dataset construction" section.

Prediction

We use the prediction of the structure of protein ABL1 in complex with two drugs, Imatinib and compound6 (PDB: 6HD6) as an example for predicting the drug-protein binding structure.

examples/prediction_example_using_PDB_6hd6.ipynb

Dataset construction

Scripts for training/test dataset construction is provided in:

examples/construction_PDBbind_training_and_test_dataset.ipynb.ipynb

The Script I used to train the model is

python main.py -d 0 -m 0 --batch_size 5 --label baseline --addNoise 5 --use_equivalent_native_y_mask

High-throughput virtual screening

TankBind also support virtual screening. In our example here, for the WDR domain of LRRK2 protein, we can screen 10,000 drug candidates in 2 minutes (or 1M in around 3 hours) with a single GPU. Check out

examples/high_throughput_virtual_screening_LRRK2_WDR.ipynb

Citation

@article{lu2022tankbind,
	title={Tankbind: Trigonometry-aware neural networks for drug-protein binding structure prediction},
	author={Lu, Wei and Wu, Qifeng and Zhang, Jixian and Rao, Jiahua and Li, Chengtao and Zheng, Shuangjia},
	journal={Advances in Neural Information Processing Systems},
	year={2022}
}

tankbind's People

Contributors

luwei0917 avatar prokia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

tankbind's Issues

apr22_pdbbind_gvp_pocket_radius20

Hi,
The function get_data from data.py tries to create the dataset from the folder apr22_pdbbind_gvp_pocket_radius20, which is not created by the notebook construction_PDBbind_training_and_testing.ipynb. Can you please tell me how to fix it?
Thank you.

Strange Molecular's pose output

I tried to use Tankbind to dock the H1 histamine receptor (https://www.rcsb.org/structure/3RZE) with a few molecular below:
Like diphenhydramine: H1 binders:
Doxepin: https://pubchem.ncbi.nlm.nih.gov/compound/Doxepin
Cyproheptadine: https://pubchem.ncbi.nlm.nih.gov/compound/2913
Loratidine: https://pubchem.ncbi.nlm.nih.gov/compound/3957

But the result return molecule very strange pose, like this, this almost cannot happen in reality, the twisted molecule is completely deformed.
Screen Shot 2022-10-03 at 16 19 3
@luwei0917 Do you have any ideas to solve this problem?

Requesting list of unseen receptors and scripts for filtering

Thanks for providing the implementation of such wonderful work!

I have noticed that the number of unseen receptors differs from EquiBind (142 v.s. 144). I think that the discrepancy in quantity arises from different training sets. And I'm seeking your assistance in obtaining a specific list of these unseen receptors. Additionally, if any scripts are available to filter these receptors, it would be immensely helpful.

Thank you in advance for your support!

self_dock.pt vs. re_dock.pt

What is the difference between the two saved models (self_dock.pt and re_dock.pt)?
Is there one that should be used over the other?

split_protein_and_ligand

Hi Wei,

I ran the example prediction code and get error when using the 'split_protein_and_ligand', i wonder if it's a network problem from jupyter lab or other problem.

Thanks!

Xnip2022-07-01_15-49-48

Waiting for the training script

Thanks for providing the implementation of such wonderful work!
Can't wait to see the full training scripts as other Forkers do!

Installation problems: libtorch_cuda_cu.so not found

We installed miniconda (user-level install), then activated a new conda environment for tankbind, and we installed TankBind following the instructions to the letter. We found the following error:

(tankbind) rodriguezg@darwin:~/code/TankBind/examples$ python -V
Python 3.8.13

(tankbind) rodriguezg@darwin:~/code/TankBind/examples$ python virtual_screening_test_tankbind.py 
Traceback (most recent call last):
  File "virtual_screening_test_tankbind.py", line 11, in <module>
    from feature_utils import get_protein_feature
  File "/data/home/rodriguezg/code/TankBind/examples/../tankbind/feature_utils.py", line 21, in <module>
    from torchdrug import data as td     # conda install torchdrug -c milagraph -c conda-forge -c pytorch -c pyg if fail to import
  File "/home/rodriguezg/miniconda3/envs/tankbind/lib/python3.8/site-packages/torchdrug/__init__.py", line 1, in <module>
    from . import patch
  File "/home/rodriguezg/miniconda3/envs/tankbind/lib/python3.8/site-packages/torchdrug/patch.py", line 13, in <module>
    from torchdrug import core, data
  File "/home/rodriguezg/miniconda3/envs/tankbind/lib/python3.8/site-packages/torchdrug/core/__init__.py", line 2, in <module>
    from .engine import Engine
  File "/home/rodriguezg/miniconda3/envs/tankbind/lib/python3.8/site-packages/torchdrug/core/engine.py", line 10, in <module>
    from torchdrug import data, core, utils
  File "/home/rodriguezg/miniconda3/envs/tankbind/lib/python3.8/site-packages/torchdrug/data/__init__.py", line 1, in <module>
    from .dictionary import PerfectHash, Dictionary
  File "/home/rodriguezg/miniconda3/envs/tankbind/lib/python3.8/site-packages/torchdrug/data/dictionary.py", line 4, in <module>
    from torch_scatter import scatter_max
  File "/home/rodriguezg/miniconda3/envs/tankbind/lib/python3.8/site-packages/torch_scatter/__init__.py", line 16, in <module>
    torch.ops.load_library(spec.origin)
  File "/home/rodriguezg/miniconda3/envs/tankbind/lib/python3.8/site-packages/torch/_ops.py", line 255, in load_library
    ctypes.CDLL(path)
  File "/home/rodriguezg/miniconda3/envs/tankbind/lib/python3.8/ctypes/__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
(tankbind) rodriguezg@darwin:~/code/TankBind/examples$ 

This library libtorch_cuda_cu.so is nowhere to be found. Any ideas?

For what is worh, the host has nvidia kernel driver version 510.85.02 (I'm not sure if this influences conda to install different versions of CUDA libs).

How to segment the protein with functional blocks?

Hi,
Thank you for the amazing work!
I am curious about how to segment the protein and how to choose the native block for a protein-ligand pair.

  1. As mentioned in Sec 3.4, "We, therefore, take full use of information stored in the whole protein instead of only the native binding region. " How to ensure the blocks (constructed with p2rank centers) cover the whole protein?

  2. As mentioned in Appendix J., "A protein block encloses the ligand when it covers more than 90% of the native interaction." And since a functional block can cover about 200 amino acids. I think the functional blocks would be constructed with some overlaps, right? And thus, there would be multiple blocks that satisfy the above-mentioned condition. Curious about how you deal with such a situation.

Thank you so much!

Is the ligand RMSD computed without hydrogen

Thanks for your promising contribution! I have a small question when reading the code. I found out that when using read_mol method, you remove the hydrogen atom and mol object is used for the later process. Is the ligand RMSD also computed based on these conformers (without hydrogen) while training?

Looking forward to your reply!

Sincerely,
Lin

how to predict coord and affinity?

  1. Where in this code or paper tell us how to predict coord and affinity?
  2. In the model figure 1, whats the meaning of "prioritization"?
    I cant find in this code or paper.
    thx.

A problem when building construction dataset

This error was encountered while building the dataset and i looked it up on the internet with no good results
image
I found problems with the following ligand, at first i wanted to delete the content,but there's too much of it
image

@luwei0917

The exact number of PDBBind training set

Hi Wei!
I'm try to use the construction_PDBbind_training_and_test_dataset.ipynb to process the PDBBind dataset manually. But I find several inconsistency which makes me confused.

  1. The size of training set reported in the newest verion of your paper is 17,787. But the size of training set outputed in construction_PDBbind_training_and_test_dataset.ipynb 17,786.
  2. I run the jupyter notebook locally and the number of ligand file that can be readable by RDKit is 19,128(then the size of final processed traning set is 17,795). But the same cell output in your raw jupyter notebook is 19,119. Is this cause by RDKit version(the version I used is 2022.03.5 installed through pip) or something else?

Issue of affinity evaluation

Hi, I'm trying to reproduce the experiment of affinity. It seems there is two checkpoint in the repo, self_dock.pt and re_dock.pt. I kind of wondering which checkpoint can be used to reproduce the results.

Question about the dataset split

Hi, it's really a great work! But I'm a little confused about the PDBbindv2020 dataset split.
In the arxiv paper you wrote

We followed the same time split as defined in EquiBind paper ........we had 17787 structures for training, 968 for validation and 363 for testing

but the EquiBind paper wrote

From the remaining complexes that are older than 2019, we remove those with ligands contained in the test set, giving 17 347 complexes for training and validation. These are divided into 968 validation complexes, which share no ligands with the remaining 16 379 train complexes

so, compared with the EquiBind train dataset, your train dataset contains 1408 more samples due to not remove those with same ligands contained in the test set?

If that's the case, I'm concerned this may lead to data leakage...

RuntimeError: mat1 and mat2 shapes cannot be multiplied (185x18 and 19x56)

when i run the follow code in examples/prediction_example_using_PDB_6hd6.ipynb and examples/high_throughput_virtual_screening_LRRK2_WDR.ipynb:

for data in tqdm(data_loader):
    data = data.to(device)
    y_pred, affinity_pred = model(data)
    affinity_pred_list.append(affinity_pred.detach().cpu())
    if False:
        # we don't need to save the predicted distance map in HTVS setting.
        for i in range(data.y_batch.max() + 1):
            y_pred_list.append((y_pred[data['y_batch'] == i]).detach().cpu())

return a error
RuntimeError: mat1 and mat2 shapes cannot be multiplied (185x18 and 19x56)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (123x18 and 19x56)

When i follow the README file and run “high_throughput_virtual_screening_LRRK2_WDR.ipynb” and "prediction_example_using_PDB_6hd6.ipynb” file ,i got the erro info as follow:

RuntimeError Traceback (most recent call last)
File ~/anaconda3/envs/tankbind_py38/lib/python3.8/site-packages/torch/nn/modules/module.py:1110, in Module._call_impl(self, *input, **kwargs)
1106 # If we don't have any hooks, we want to skip the rest of the logic in
1107 # this function, and just call forward.
1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1109 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110 return forward_call(*input, **kwargs)
1111 # Do not call functions when jit is used
1112 full_backward_hooks, non_full_backward_hooks = [], []

File ~/TankBind-main-new/TankBind-main/examples/../tankbind/model.py:343, in IaBNet_with_affinity.forward(self, data)
341 edge_weight = data[("compound", "c2c", "compound")].edge_weight
342 compound_batch = data['compound'].batch
--> 343 compound_out = self.conv_compound(compound_edge_index,edge_weight,compound_edge_feature,compound_x.shape[0],compound_x)['node_feature']
345 # protein_batch version could further process b matrix. better than for loop.
346 # protein_out_batched of shape b, n, c
...
File ~/anaconda3/envs/tankbind_py38/lib/python3.8/site-packages/torch/nn/modules/linear.py:103, in Linear.forward(self, input)
102 def forward(self, input: Tensor) -> Tensor:
--> 103 return F.linear(input, self.weight, self.bias)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (123x18 and 19x56)

it seems like the example data given not match GVP size, i'd appreciate it If you could help me fix it.

difference of reporting experiment results

There is difference between reporting experiment results of paper and results from github code.
Is the paper results from TBind v1.0.1, the github from TBind v0.5.0 ???

where should be p2rank located?

Question on p2rank to segment the protein:

bash: /packages/p2rank_2.3/prank: No such file or directory

where should be p2rank located?

Question about Evaluation Script

Thanks for your awesome work!

I try to reproduce the results reported in paper using the checkpoint self-dock.pt. But the results seem different. And even in a more simple setting where I have already known the true binding site (which is 20A radius around true compound center), the results show difference.

I follow the evaluation paradigm as I first predict distance map and then use this distance map for 5000 epoch's gradient descent. But my result shows that I only get 20% test data whithin 5A threshold. Do I make some mistakes?

I would appreciate it very much if you could provide the evaluation script!

M1 Mac optimisation

Hello and thanks for the code.

I was wondering before I get into it, if the tools has been tested on M1 machines and if it is optimised to be used on them.
I know that CUDA GPU compatibility is complicated with M1 macs and I've noticed in the installation instructions that you need cudatoolkit.

Many thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.