luwei0917 / tankbind Goto Github PK

Open source code for TankBind. Galixir Tenchnologies

License: MIT License

Python 100.00%

tankbind's Introduction

Source code for the NeurIPS 2022 paper TANKBind: Trigonometry-Aware Neural NetworKs for Drug-Protein Binding Structure Prediction

TankBind

TankBind could predict both the protein-ligand binding structure and their affinity.

The primary purpose of this repository is to enable the reproduction of the results reported in the paper, as well as to facilitate the work of others who wish to build upon it. To experience the latest version, which includes various improvements made to the model, simply create an account at https://m1.galixir.com/public/login_en/index.html.

If you have any question or suggestion, please feel free to open an issue or email me at [email protected] or shuangjia zheng at [email protected].

Installation

conda create -n tankbind_py38 python=3.8
conda activate tankbind_py38

You might want to change the cudatoolkit version based on the GPU you are using.:

conda install pytorch cudatoolkit=11.3 -c pytorch

conda install torchdrug=0.1.2 pyg=2.1.0 biopython nglview jupyterlab -c milagraph -c conda-forge -c pytorch -c pyg
pip install torchmetrics tqdm mlcrate pyarrow
rdkit version used: 2021.03.4

p2rank v2.3 could be downloaded from here:

https://github.com/rdk/p2rank/releases/download/2.3/p2rank_2.3.tar.gz

Test set evaluation

We include the script for reproducing the self-dock result in

examples/testset_evaluation_cleaned.ipynb

The test_dataset is constructed using the notebook in "Dataset construction" section.

Prediction

We use the prediction of the structure of protein ABL1 in complex with two drugs, Imatinib and compound6 (PDB: 6HD6) as an example for predicting the drug-protein binding structure.

examples/prediction_example_using_PDB_6hd6.ipynb

Dataset construction

Scripts for training/test dataset construction is provided in:

examples/construction_PDBbind_training_and_test_dataset.ipynb.ipynb

The Script I used to train the model is

python main.py -d 0 -m 0 --batch_size 5 --label baseline --addNoise 5 --use_equivalent_native_y_mask

High-throughput virtual screening

TankBind also support virtual screening. In our example here, for the WDR domain of LRRK2 protein, we can screen 10,000 drug candidates in 2 minutes (or 1M in around 3 hours) with a single GPU. Check out

examples/high_throughput_virtual_screening_LRRK2_WDR.ipynb

Citation

@article{lu2022tankbind,
	title={Tankbind: Trigonometry-aware neural networks for drug-protein binding structure prediction},
	author={Lu, Wei and Wu, Qifeng and Zhang, Jixian and Rao, Jiahua and Li, Chengtao and Zheng, Shuangjia},
	journal={Advances in Neural Information Processing Systems},
	year={2022}
}

tankbind's People

Contributors

Stargazers

Watchers

Forkers

prokia shunsunsun zyh0608 david-webb xzk9 owenustc superxiang qshao mingchenchen dongcf yaoyinying masterwhook thomasly yuyingbuaa11 beira-bf simlif yunxiangz dieg0alejandr0 yufengwhy chang-bwl flashlaser scosgro2 gordon5-ai nicemaster vinhsuhi clvnmng bbyun28 theangle134 cjiang-git aenchanteda spadavec jerryjohnsonlee hojae-m-choi renly0313 darrengao628 maxindian dingluoxmu chunxi168 yansonggu nyubachi arunraja-hub zoujoelin haomingcs riccardosabatini vas2201 lindsey98 biocoder007 kir-

tankbind's Issues

Question about the dataset split

Hi, it's really a great work! But I'm a little confused about the PDBbindv2020 dataset split.
In the arxiv paper you wrote

We followed the same time split as defined in EquiBind paper ........we had 17787 structures for training, 968 for validation and 363 for testing

but the EquiBind paper wrote

From the remaining complexes that are older than 2019, we remove those with ligands contained in the test set, giving 17 347 complexes for training and validation. These are divided into 968 validation complexes, which share no ligands with the remaining 16 379 train complexes

so, compared with the EquiBind train dataset, your train dataset contains 1408 more samples due to not remove those with same ligands contained in the test set?

If that's the case, I'm concerned this may lead to data leakage...

How to segment the protein with functional blocks?

Hi,
Thank you for the amazing work!
I am curious about how to segment the protein and how to choose the native block for a protein-ligand pair.

As mentioned in Sec 3.4, "We, therefore, take full use of information stored in the whole protein instead of only the native binding region. " How to ensure the blocks (constructed with p2rank centers) cover the whole protein?
As mentioned in Appendix J., "A protein block encloses the ligand when it covers more than 90% of the native interaction." And since a functional block can cover about 200 amino acids. I think the functional blocks would be constructed with some overlaps, right? And thus, there would be multiple blocks that satisfy the above-mentioned condition. Curious about how you deal with such a situation.

Thank you so much!

M1 Mac optimisation

Hello and thanks for the code.

I was wondering before I get into it, if the tools has been tested on M1 machines and if it is optimised to be used on them.
I know that CUDA GPU compatibility is complicated with M1 macs and I've noticed in the installation instructions that you need cudatoolkit.

Many thanks

A problem when building construction dataset

This error was encountered while building the dataset and i looked it up on the internet with no good results

I found problems with the following ligand, at first i wanted to delete the content,but there's too much of it

@luwei0917

Reproduction of Model Training

Dear Wei

Could you please provide guidelines for reproducing model training in the Readme?

Thanks.

Issue of affinity evaluation

Hi, I'm trying to reproduce the experiment of affinity. It seems there is two checkpoint in the repo, self_dock.pt and re_dock.pt. I kind of wondering which checkpoint can be used to reproduce the results.

how to predict coord and affinity？

Where in this code or paper tell us how to predict coord and affinity？
In the model figure 1, whats the meaning of "prioritization"?
I cant find in this code or paper.
thx.

Matrix dimension error when I am trying to inference

Is the ligand RMSD computed without hydrogen

Thanks for your promising contribution! I have a small question when reading the code. I found out that when using read_mol method, you remove the hydrogen atom and mol object is used for the later process. Is the ligand RMSD also computed based on these conformers (without hydrogen) while training?

Looking forward to your reply!

Sincerely,
Lin

The papar description and link are directed to biorxiv not arxiv

Congratulation to the fantistic job! While the links are directed to the biorxiv, not arxiv. Just a tiny mislink.

How to solve the matrix mismatch in testset_ evaluation_clean.ipynb

RuntimeError: mat1 and mat2 shapes cannot be multiplied (12x18 and 19x56)
Why did this problem occur?

split_protein_and_ligand

Hi Wei,

I ran the example prediction code and get error when using the 'split_protein_and_ligand', i wonder if it's a network problem from jupyter lab or other problem.

Thanks!

Question about Evaluation Script

Thanks for your awesome work!

I try to reproduce the results reported in paper using the checkpoint self-dock.pt. But the results seem different. And even in a more simple setting where I have already known the true binding site (which is 20A radius around true compound center), the results show difference.

I follow the evaluation paradigm as I first predict distance map and then use this distance map for 5000 epoch's gradient descent. But my result shows that I only get 20% test data whithin 5A threshold. Do I make some mistakes?

I would appreciate it very much if you could provide the evaluation script!

ImportError: No module named 'torch_ext'

Can you tell me how to solve this problem?

Strange Molecular's pose output

I tried to use Tankbind to dock the H1 histamine receptor (https://www.rcsb.org/structure/3RZE) with a few molecular below:
Like diphenhydramine: H1 binders:
Doxepin: https://pubchem.ncbi.nlm.nih.gov/compound/Doxepin
Cyproheptadine: https://pubchem.ncbi.nlm.nih.gov/compound/2913
Loratidine: https://pubchem.ncbi.nlm.nih.gov/compound/3957

But the result return molecule very strange pose, like this, this almost cannot happen in reality, the twisted molecule is completely deformed.

@luwei0917 Do you have any ideas to solve this problem?

Waiting for the training script

Thanks for providing the implementation of such wonderful work!
Can't wait to see the full training scripts as other Forkers do!

Requesting list of unseen receptors and scripts for filtering

Thanks for providing the implementation of such wonderful work!

I have noticed that the number of unseen receptors differs from EquiBind (142 v.s. 144). I think that the discrepancy in quantity arises from different training sets. And I'm seeking your assistance in obtaining a specific list of these unseen receptors. Additionally, if any scripts are available to filter these receptors, it would be immensely helpful.

Thank you in advance for your support!

The exact number of PDBBind training set

Hi Wei!
I'm try to use the construction_PDBbind_training_and_test_dataset.ipynb to process the PDBBind dataset manually. But I find several inconsistency which makes me confused.

The size of training set reported in the newest verion of your paper is 17,787. But the size of training set outputed in construction_PDBbind_training_and_test_dataset.ipynb 17,786.
I run the jupyter notebook locally and the number of ligand file that can be readable by RDKit is 19,128(then the size of final processed traning set is 17,795). But the same cell output in your raw jupyter notebook is 19,119. Is this cause by RDKit version(the version I used is 2022.03.5 installed through pip) or something else?

Installation problems: libtorch_cuda_cu.so not found

We installed miniconda (user-level install), then activated a new conda environment for tankbind, and we installed TankBind following the instructions to the letter. We found the following error:

(tankbind) rodriguezg@darwin:~/code/TankBind/examples$ python -V
Python 3.8.13

(tankbind) rodriguezg@darwin:~/code/TankBind/examples$ python virtual_screening_test_tankbind.py 
Traceback (most recent call last):
  File "virtual_screening_test_tankbind.py", line 11, in <module>
    from feature_utils import get_protein_feature
  File "/data/home/rodriguezg/code/TankBind/examples/../tankbind/feature_utils.py", line 21, in <module>
    from torchdrug import data as td     # conda install torchdrug -c milagraph -c conda-forge -c pytorch -c pyg if fail to import
  File "/home/rodriguezg/miniconda3/envs/tankbind/lib/python3.8/site-packages/torchdrug/__init__.py", line 1, in <module>
    from . import patch
  File "/home/rodriguezg/miniconda3/envs/tankbind/lib/python3.8/site-packages/torchdrug/patch.py", line 13, in <module>
    from torchdrug import core, data
  File "/home/rodriguezg/miniconda3/envs/tankbind/lib/python3.8/site-packages/torchdrug/core/__init__.py", line 2, in <module>
    from .engine import Engine
  File "/home/rodriguezg/miniconda3/envs/tankbind/lib/python3.8/site-packages/torchdrug/core/engine.py", line 10, in <module>
    from torchdrug import data, core, utils
  File "/home/rodriguezg/miniconda3/envs/tankbind/lib/python3.8/site-packages/torchdrug/data/__init__.py", line 1, in <module>
    from .dictionary import PerfectHash, Dictionary
  File "/home/rodriguezg/miniconda3/envs/tankbind/lib/python3.8/site-packages/torchdrug/data/dictionary.py", line 4, in <module>
    from torch_scatter import scatter_max
  File "/home/rodriguezg/miniconda3/envs/tankbind/lib/python3.8/site-packages/torch_scatter/__init__.py", line 16, in <module>
    torch.ops.load_library(spec.origin)
  File "/home/rodriguezg/miniconda3/envs/tankbind/lib/python3.8/site-packages/torch/_ops.py", line 255, in load_library
    ctypes.CDLL(path)
  File "/home/rodriguezg/miniconda3/envs/tankbind/lib/python3.8/ctypes/__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
(tankbind) rodriguezg@darwin:~/code/TankBind/examples$

This library libtorch_cuda_cu.so is nowhere to be found. Any ideas?

For what is worh, the host has nvidia kernel driver version 510.85.02 (I'm not sure if this influences conda to install different versions of CUDA libs).

run model(data) error

apr22_pdbbind_gvp_pocket_radius20

Hi,
The function get_data from data.py tries to create the dataset from the folder apr22_pdbbind_gvp_pocket_radius20, which is not created by the notebook construction_PDBbind_training_and_testing.ipynb. Can you please tell me how to fix it?
Thank you.

RuntimeError: mat1 and mat2 shapes cannot be multiplied (123x18 and 19x56)

When i follow the README file and run “high_throughput_virtual_screening_LRRK2_WDR.ipynb” and "prediction_example_using_PDB_6hd6.ipynb” file ，i got the erro info as follow:

RuntimeError Traceback (most recent call last)
File ~/anaconda3/envs/tankbind_py38/lib/python3.8/site-packages/torch/nn/modules/module.py:1110, in Module._call_impl(self, *input, **kwargs)
1106 # If we don't have any hooks, we want to skip the rest of the logic in
1107 # this function, and just call forward.
1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1109 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110 return forward_call(*input, **kwargs)
1111 # Do not call functions when jit is used
1112 full_backward_hooks, non_full_backward_hooks = [], []

File ~/TankBind-main-new/TankBind-main/examples/../tankbind/model.py:343, in IaBNet_with_affinity.forward(self, data)
341 edge_weight = data[("compound", "c2c", "compound")].edge_weight
342 compound_batch = data['compound'].batch
--> 343 compound_out = self.conv_compound(compound_edge_index,edge_weight,compound_edge_feature,compound_x.shape[0],compound_x)['node_feature']
345 # protein_batch version could further process b matrix. better than for loop.
346 # protein_out_batched of shape b, n, c
...
File ~/anaconda3/envs/tankbind_py38/lib/python3.8/site-packages/torch/nn/modules/linear.py:103, in Linear.forward(self, input)
102 def forward(self, input: Tensor) -> Tensor:
--> 103 return F.linear(input, self.weight, self.bias)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (123x18 and 19x56)

it seems like the example data given not match GVP size, i'd appreciate it If you could help me fix it.

Choice of binding site or whole protein docking?

Hi,

I was wondering if it was possible to select a specific binding site with this software or if it is limited to whole protein docking. Thanks for your insight.

Cheers,

Tony

RuntimeError: mat1 and mat2 shapes cannot be multiplied (185x18 and 19x56)

when i run the follow code in examples/prediction_example_using_PDB_6hd6.ipynb and examples/high_throughput_virtual_screening_LRRK2_WDR.ipynb:

for data in tqdm(data_loader):
    data = data.to(device)
    y_pred, affinity_pred = model(data)
    affinity_pred_list.append(affinity_pred.detach().cpu())
    if False:
        # we don't need to save the predicted distance map in HTVS setting.
        for i in range(data.y_batch.max() + 1):
            y_pred_list.append((y_pred[data['y_batch'] == i]).detach().cpu())

return a error
RuntimeError: mat1 and mat2 shapes cannot be multiplied (185x18 and 19x56)

where should be p2rank located?

Question on p2rank to segment the protein:

bash: /packages/p2rank_2.3/prank: No such file or directory

where should be p2rank located?

The following problem arise with the p2rank in construction_PDBbind_training_and_test_dataset

The command can be run on the terminal, but cannot be run on the jupyter. The following error occurs:
/TankBind/p2rank_2: syntax error: operand expected (error token is "/share/home//TankBind/p2rank_2")

RuntimeError: mat1 and mat2 shapes cannot be multiplied (123x18 and 19x56)

Hi Wei,

I ran the example prediction code and get error when using the 'affinity_pred_list = torch.cat(affinity_pred_list)', I don't know what caused error and how to fix it.

Thanks!

luwei0917 / tankbind Goto Github PK

tankbind's Introduction

TankBind

Installation

Test set evaluation

Prediction

Dataset construction

High-throughput virtual screening

Citation

tankbind's People

Contributors

Stargazers

Watchers

Forkers

tankbind's Issues

Recommend Projects

Recommend Topics

Recommend Org