octavian-ganea / equidock_public Goto Github PK
View Code? Open in Web Editor NEWEquiDock: geometric deep learning for fast rigid 3D protein-protein docking
License: MIT License
EquiDock: geometric deep learning for fast rigid 3D protein-protein docking
License: MIT License
Hi there,
I'd like to report a bug on installation. So far my workaround was to use dgl==0.9.0 rather than the dgl==0.7.0 you have in the requirements.
Also, is there an easy way to interface with the models for a custom set of PDBs? I'd prefer to avoid having to fiddle with the inference_rigid.py but from what it seems there's no way to pass custom sets other than perhaps inserting them as test data?
Hi, dear authors of Equidock, I feel really sad to hear the news that Ganea passed away without fully showing his extraordinary genius.
I came across the calculation of the Kabsh Model and found that the computation of the rotation matrix is somehow misleading. To be specific, U, S, Vt = np.linalg.svd(H)
gives us the U, S, V^T, which corresponds to U2, S, U1^T in the paper. Next, the rotation matrix is obtained via R = Vt.T @ U.T
, which is different from what is described in the text. Instead, R = U2 @ U^T
, which should be R = U @ Vt
in the code. Do you agree with me?
When I run the command as follows:
python preprocess_raw_data.py -n_jobs 60 -data dips -graph_nodes residues -graph_cutoff 30 -graph_max_neighbor 10 -graph_residue_loc_is_alphaC -pocket_cutoff 8 -data_fraction 1.0
it can generate six files in the directory /extendplus/jiashan/equidock_public/src/cache/dips_residues_maxneighbor_10_cutoff_30.0_pocketCut_8.0/cv_0
with files
label_test.pkl ligand_graph_test.bin receptor_graph_test.bin
label_val.pkl ligand_graph_val.bin receptor_graph_val.bin
However, three more files could not be generated successfully, and report errors as follows:
Processing ./cache/dips_residues_maxneighbor_10_cutoff_30.0_pocketCut_8.0/cv_0/label_frac_1.0_train.pkl
Num of pairs in train = 39901
Killed
Could you help me solve this problem?
Thanks!
Hi, thanks for the great work!! I have a question regarding the following point in the paper:
On p.7 it is stated that:
we unfortunately do not know the actual alignment between points in
$Y_l$ and$P_l$
, for every$l ∈ {1, 2}$ . This can be recovered using an additional optimal transport loss
However, in the code here :
https://github.com/octavian-ganea/equidock_public/blob/main/src/train.py#L128
The optimal transport matrix (the 2nd returned variable) is ignored:
ot_dist, _ = compute_ot_emd(cost_mat_ligand + cost_mat_receptor, args['device'])
In my understanding, the matrix should be used to recovered the alignment.
So I am now confused how the points alignment can be recovered without this optimal transport matrix?
Thank you so much again!
dgl version 0.7.0 is no longer available for installation.(Also POT...)
Can you offer another version with corresponding other library versions?
Thank you for this great tool.
I am starting to install it on Ubuntu 20.04, and met with multiple FileNotFound Errors. Are there any dependencies not listed?
Here are the errors:
(base) nc1@nc1-UA9C-R38:/equidock_public-main$ # Extract the raw PDB files:/equidock_public-main$ python3 project/datasets/builder/extract_raw_pdb_gz_archives.py project/datasets/DIPS/raw/pdb
(base) nc1@nc1-UA9C-R38:
python3: can't open file '/home/nc1/equidock_public-main/project/datasets/builder/extract_raw_pdb_gz_archives.py': [Errno 2] No such file or directory
(base) nc1@nc1-UA9C-R38:/equidock_public-main$/equidock_public-main$ # Process the raw PDB data into associated pair files:
(base) nc1@nc1-UA9C-R38:
(base) nc1@nc1-UA9C-R38:/equidock_public-main$ python3 project/datasets/builder/make_dataset.py project/datasets/DIPS/raw/pdb project/datasets/DIPS/interim --num_cpus 28 --source_type rcsb --bound/equidock_public-main$
python3: can't open file '/home/nc1/equidock_public-main/project/datasets/builder/make_dataset.py': [Errno 2] No such file or directory
(base) nc1@nc1-UA9C-R38:
(base) nc1@nc1-UA9C-R38:/equidock_public-main$ # Apply additional filtering criteria:/equidock_public-main$ python3 project/datasets/builder/prune_pairs.py project/datasets/DIPS/interim/pairs project/datasets/DIPS/filters project/datasets/DIPS/interim/pairs-pruned --num_cpus 28
(base) nc1@nc1-UA9C-R38:
python3: can't open file '/home/nc1/equidock_public-main/project/datasets/builder/prune_pairs.py': [Errno 2] No such file or directory
Hello!
I was working with your code and found out that the best validation score used in the project (val_complex_rmsd_median) differs from what is enunciated in the article presenting EquiDock (val_ligand_rmsd_median). Is there any reason behind this choice or am I misinterpreting something ?
Line 372 in ac2c754
python==3.9.10
numpy==1.22.1
cuda==10.1
torch==1.10.2
dgl==0.7.0
biopandas==0.2.8
ot==0.7.0
rdkit==2021.09.4
dgllife==0.2.8
joblib==1.1.0
Shouldn't 'ot' be 'POT' for 'Python Optimal Transport' ?
Great work, but I don't see any documentation for the inference script or even an ArgumentParser/etc. I think I can figure it out from the code, but but it would be nice for simple docking inference to use for predictions. Apologies if this information is somewhere else and I missed it.
Hi,
I am having some issues looking through test_sets_pdb
:
original (undocked) structures are in: db5_test_random_transform
with ligands (part that moves) being in random_transformed
, receptor (not movable) being in complexes
and results in db5_equidock_results
.
So for example if we take from db5_equidock_results
the following pose: 1AVX_l_b_EQUIDOCK.pdb
this means that 1AVX_l_b.pdb
was used as ligand (movable) and 1AVX_r_b_complex.pdb
as receptor (not moved). This means that if I superimposse in pymol 1AVX_l_b_EQUIDOCK.pdb
and `1AVX_r_b_complex.pdb' I should get a nicely docked complex, however this is not the case. There are many many clashes.
Can you help?
Best,
Liviu
hello,
when I run the command: python preprocess_raw_data.py -n_jobs 20 -data db5 -graph_nodes residues -graph_cutoff 30 -graph_max_neighbor 10 -graph_residue_loc_is_alphaC -pocket_cutoff 8
I got the following error:
Processing split 1
Processing ./cache/db5_residues_maxneighbor_10_cutoff_30.0_pocketCut_8.0/cv_1\label_val.pkl
Traceback (most recent call last):
File "C:\Users\equidock_public-main\src\preprocess_raw_data.py", line 37, in
Unbound_Bound_Data(args, reload_mode='val', load_from_cache=False, raw_data_path=raw_data_path,
File "C:\Users\equidock_public-main\src\utils\db5_data.py", line 78, in init
with open(os.path.join(split_files_path, reload_mode + '.txt'), 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/benchmark5.5/cv/cv_0\cv_1\val.txt'
Any suggestions?
Thanks
after reading your paper, I have a question: what is the difference between rigid protein docking and protein-protein docking? in my understanding, it is almost the same task. is it right?
add a -W to ignore check,
rsync -rlpt -v -W -z --delete --port=33444 rsync.rcsb.org::ftp_data/biounit/coordinates/divided/ ./DIPS/raw/pdb
hope it useful
Here me again, I got another question.
I am trying to repeat from your test_sets_pdb. But in the folder of test_sets_pdb/db5_equidock_results, I did not see docked complex structure. where is it?
in the folder of db5_test_random_transformed, there is one called complexes. If we are docking, why do we need the complex anyway.
Your help is greatly appreciated.
Although we have never met, your work has led me into new territory. I can't believe the history of journalism. Always have respect for your gratitude.
Hello !
It is a bit unclear to me which hyper parameters you used to train your model, could you provide a complete list of your best models for DIPS and DB5? In particular, I am not sure whether node and edge features were used. Moreover, the hyperparameters you mention in the paper are not the same as your best model's checkpoints.
Thanks :)
as we can see in the DISP_Plus, the author said: about make dips dataset #7
BioinfoMachineLearning/DIPS-Plus#7
"deadlock of sorts after a certain number of complexes have been processed."
when run for a long time, it will appeare By chance,
but for few files , it always run success, so i Split up to process the make_dataset
first mkdir six different fold like tmp1 tmp2 ....
mkdir tmp1
than cd in pdb and move the files to the new fold
mv ls | head -200
../tmp6
run make_dataset.py seperate:
python3 make_dataset.py project/datasets/DIPS/raw/tmp1 project/datasets/DIPS/interim --num_cpus 24 --source_type rcsb --bound
Sorry to interrupt you, can you provide a processed DIPS dataset? Because I download DIPS very slowly
Hi 👋! In the paper it is mentioned: "For DIPS, the split is based on protein family to separate similar proteins". Is there a source code for this split? I could only find a random split in paritition_dips.py
.
Hi @octavian-ganea ,
Thank you for your great work!
I have a question about the Figure 12. Would you please tell me how to draw Figure 12?
In fact, I try to draw it as follows.
First, the bound and unbound data are from the folder data/benchmark5.5/structures/
Then, I calculate the crmsd and irmsd of bound and unbound structures as the code in eval_pdb_outputset.py, where the unbound coordinate as the 'pred_coord' and the bound one as the 'ground_truth_coord'. However, I met the error that most of the number of residues in bound and corresponding unbound structures are not equal (to be exact, 179 ligand not matched , and 28 receptor not matched). So, how did you matched the corresponding bound and unbound structures? Or there is any mistake as I did?
Looking forward to your reply! Thanks!
Hi, I created a virtual environment and tried to pip install the dependencies listed in README.md. However, I'm not able to install some of them (e.g. cuda & dgl==0.7.0). Can I request for a requirements.txt to install the dependencies?
Thank you! :)
Hi, @AxelGiottonini, when I use the command CUDA_VISIBLE_DEVICES=0 python -m src.train -hyper_search
to run the code, I get the following results:
[2023-03-09 05:47:00.038149] [FINAL TEST for dips] --> epoch -1/10000 || mean/median complex rmsd 16.2906 / 15.7649 || mean/median ligand rmsd 35.9814 / 33.6197 || mean/median sqrt pocket OT loss 28.6057 || intersection loss 21.3417 || mean/median receptor rmsd 0.0000 / 0.0000
[2023-03-09 09:37:52.987988] [FINAL TEST for db5] --> epoch -1/10000 || mean/median complex rmsd 16.7756 / 16.5510 || mean/median ligand rmsd 40.3175 / 36.7189 || mean/median sqrt pocket OT loss 31.0876 || intersection loss 28.0045 || mean/median receptor rmsd 0.0000 / 0.0000
from the results, we can get the mean rmsd of dips test is 16.29, and the mean rmsd of db5 test is 16.77, but in paper, table 1 shows that the mean rmsd of dips test is 14.52, and the mean rmsd of db5 is 14.72, it has a lower performance than mentioned in the original paper
what is wrong?
and how to get the interface rmsd?
The code does not work. There is no documentation. Please provide a working example.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.