drorlab / atom3d Goto Github PK
View Code? Open in Web Editor NEWATOM3D: tasks on molecules in three dimensions
Home Page: https://www.atom3d.ai
License: MIT License
ATOM3D: tasks on molecules in three dimensions
Home Page: https://www.atom3d.ai
License: MIT License
Hello!
Are any of the trained models (model parameters/weights) available? For instance the 3D CNN trained on the residue deletion task.
Thanks.
I'd like to know what batch size and the number of epochs were used to produce the result of GCN on LBA in the paper. In train_pdbbind.py, the batch size is set to 1 and the number of epochs is 100. However, the test RMSE I got using this setting more than ten times higher than the one reported in the paper.
Hi, thanks for this amazing and comprehensive work!
After I download the full dataset from https://www.atom3d.ai/pip.html, I load the DB5 dataset, and print its length which is 0. The DIPS part data is correct.
First of all, thank you so much for your great repo!
I think you are changing dataload, there were some conflicts. If you would like to provide some demos, that will be a great help.
many thanks
Hi, on Example section of this page https://atom3d.readthedocs.io/en/latest/using_datasets.html
When demonstrating how to Extract all atoms within 5 Angstroms of a ligand,
you have following code
`
lig_coords = fo.get_coordinates_from_df(atoms_df[atoms_df['subunit']=='LIG']) # get coords of ligand
df_filtered = distance_filter(atoms_df, lig_coords, dist=5.0)`
But I find that the data type of subunit is int64, so this line of code looks wrong to me, is there something I missing?
Thanks in advance!
Dear Authors,
Thanks for your great works.
I tried to use this repo. But I met some problems running example script.
The first problem is how to create dataset from scratch. For example, in README file, what is "PATH_TO_INPUT_DIR"? Should I download raw data and put it into PATH_TO_INPUT_DIR? Or it is already in the repo?
Then, after I process the dataset, should I run "python data.py" & "python trian.py" in the ./example folder directly?
Thanks in advance for your time!
Hi, I have an question about da.load_dataset.
I prepared a new 'lmdb' from pdb, and add the label based on the tutorial. However, after reading from da.load_dataset, the 'y' does not show a single value, instead it would show in a form of Dataframe. please see below. Could you advice me how to deal with this?
train_dataset = da.load_dataset(PATH_TO_LMDB_OUTPUT, 'lmdb', transform=tr.GraphTransform(atom_key='atoms', label_key='label'))
for i in train_dataset[0]:
print(i)
('y', label
0 5.72556)
Thus, when doing for the following command, the batch.y does not show the format of a tensor.
batch in train_loader:
print(batch.y)
The way I generated the lmdb is :
def add_label(item):
# Remove the file ending ".pdb" from the ID
name = item['id'][:-4]
# Get label data
label_file = os.path.join(PATH_TO_LABELS_DIR, name+'.csv')
# Add label data in form of a data frame
item['label'] = pd.read_csv(label_file)
return item
## Load dataset from directory of PDB files
dataset = da.load_dataset(PATH_TO_INPUT_DIR, 'pdb', transform=add_label)
# Create LMDB dataset from PDB dataset
da.make_lmdb_dataset(dataset, PATH_TO_LMDB_OUTPUT)
If possible, could you give an example of the content in the csv file for add_label? I'd just like to make sure the format is right.
Thank you in advance to looking into this question.
Hi there,
Thank you for providing the code base.
I have some questions on the LEP dataset. So it seems that there are multiple values for the chain
in atoms, e.g. I have the following 5 chain sets for all atoms:
'L' stands for the ligands, but what about others? And according to this function, you are saying any chain that is not 'L' is treated to be in the pocket? Including 'A', 'B', 'C', 'D', 'E', 'G' right?
Just want to double-check to better understand the dataset. Any help is appreciated.
Hi there,
Thank you for providing the code base. I have one question about LBA dataset.
After downloading, I found there are 4,463 datapoints under folder pdbbind_2019-refined-set
, however, on the PDBBind website, it shows that pdbbind_2018
version has the same number of datapoints. So I just want to double-check which version are you using?
In PPI and LBA, how do you specify the coordinate of two 3d graph? e.g., 2 proteins, ligand and target.
ModuleNotFoundError: No module named 'atom3d.datasets.ppi'
I followed https://github.com/drorlab/atom3d/blob/master/examples/lba/enn/README.rst, used
cd atom3d/examples/lba/enn
python train.py --target neglog_aff --load
--prefix lba-id30_cutoff-06_maxnumat-600
--datadir $LMDBDIR --format lmdb
--cgprod-bounded
--radius 6 --maxnum 600
--batch-size 1 --num-epoch 150
my terminal and log file show "Beginning training on CUDA/GPU! Device: 0", but nvidia-smi shows No running processes found.
It happened when I tried the lep example. However, at the same virtual env, I tried https://github.com/drorlab/cormorant, python examples/train_qm9.py. It worked well.
Hi @drorlab,
Thank you for contributing Atom3D. I would like to know where can we know the benchmark information for those datasets? Are you going to maintain a benchmark just like OGB?
thank you!
Hi, thank you for the efforts towards a common resource for structural biology benchmarks. I was currently working with the PDBBind dataset and wanted to test the model on the splits you provided. From the link for the dataset (https://www.atom3d.ai/lba.html), I was able to find the identity_60 splits, but could not find the identity_30 splits.
Could you please link me to the same ?
Hello.
I'm doing an SMP experiment, and I found that the edge feature dimension is 4.
However, I couldn't figure out what each element stands for.
Could you give me some instruction about the features of edges of molecule?
Thank you.
If I wants to prepare a new dataset for LBA task, what should I do to deal with my data?
I prepared protein pdb files with corresponding ligand sdf files for this task.
And I see atom3d/atom3d/datasets/lba/process_pdbbind.py that generate three files of format (sdf, cif, cif). But in prepare_lmdb.py need file with formats: (sdf, pdb, pdb) correspondingly.
In my kind of view, I should first transform my dataset to generate (ligand, pocket, protein) fileset with process_pdbbind.py
and then use prepare_lmdb.py
to generate final hdf5 file for further loading by LMDBDataset. But it's clearly that now I cannot do in that way.
There also exists multiple scripts like create_hdf5.py
that seems to be another way to generate hdf5 lmdb file.
So HOW should I do in depth to prepare?
I'm feeling super confusing on preparing new dataset and doing specific task with atom3d.
Hi, I'm unable to reproduce the results quoted in the paper for the performance achieved by the baseline GNN model on the LBA dataset with 30% sequence identity split. I'm following atom3d/examples/lba/gnn to download the dataset and train the model with hyperparameters given in the README. Over 6 runs with different seeds, I'm getting 1.58 +- 0.04 for test RMSE, 0.53 +- 0.03 for test Pearson and 0.54 +- 0.04 for test Spearman. Only the RMSE is consistent with what's reported in the paper. Can the authors confirm that the hyperparameters and code given in atom3d/examples/lba/gnn are identical to what's used to produce the results in the paper? Thanks!
I cannot find any lmdb files from the LBA dataset. I downloaded the dataset from the link posted here, and it only contained pdbbind_3dcnn.h5 and some txt files. I wonder how I can access the lmdb dataset.
I was trying to make a lmdb file from an xyz file through atom3d.datasets, but I keep receiving this error:
File "$DIR/python3.7/site-packages/atom3d/util/formats.py", line 128, in df_to_bps
atom['element'])
File "$DIR/python3.7/site-packages/Bio/PDB/Atom.py", line 96, in init
assert not element or element == element.upper(), element
AssertionError: Si
The first line of my xyz file is: Si -0.04553300 -1.14436300 0.73426900
But I assume that Si is a proper element. How can I fix this issue?
Hi,
prot_df_to_graph and mol_df_to_graph result in graphs with connections that go only one way, which is due to the output of scipy's query_pairs. A possible fix would be to use:
edges = torch.cat((edges, edges.flip(dims=(0,))), dim=1)
Hello.
First off, I want to say that it is a sight for sore eyes seeing a repository as well documented as this. I am very impressed given the scope of it.
On another note, I did encounter an odd ModuleNotFoundError when trying to the train_ppi.py script locally in the provided "geometric" Conda environment:
:665: in _load_unlocked
???
../../../../../anaconda3/envs/geometric/lib/python3.6/site-packages/_pytest/assertion/rewrite.py:170: in exec_module
exec(co, module.dict)
train_ppi.py:11: in
import ppi_dataloader as dl
ppi_dataloader.py:15: in
import atom3d.torch.graph as gr
E ModuleNotFoundError: No module named 'atom3d.torch'
It appears as though a directory (atom3d.torch) did not get added to version control and, as such, is not appearing remotely. Was this omission intentional?
Thank you for your time!
When I try to pull the res dataset I get:
>>> da.download_dataset(out_path="/path/to/data/atom3d-data", name="res", split="cath-topology")
--2022-02-05 12:50:33-- http://1rjeayyofn0y6pgnljyg0fy5fkqaopoqc/
Resolving 1rjeayyofn0y6pgnljyg0fy5fkqaopoqc (1rjeayyofn0y6pgnljyg0fy5fkqaopoqc)... failed: Name or service not known.
wget: unable to resolve host address ‘1rjeayyofn0y6pgnljyg0fy5fkqaopoqc’
gzip: stdin: unexpected end of file
tar: Child returned status 1
tar: Error is not recoverable: exiting now
With or without splits this breaks. Pulling the link from the file and checking its also broken.
Hello, thank you for providing such a great dataset collection to the community. I was recently working on the PSR dataset and noticed some possible improvements can be made. Mainly, about maintainability and some more information.
Again, thank you for sharing your great work.
Based on lba problems, for each model: cnn3d, enn, and gnn, how to tune the models if I use my own data? How can I evaluate the models and do predictions? If you can provide examples, that will be a great help. Many thanks
I was hoping to recreate the ppi benchmarking example for GNNs. It seems like the ppi_dataloader.py and most of the non QM9 data loaders import an atom3d.shard package which no longer exists. What is the best way to recreate the ppi experiment for GNNs?
Thank you!
Hi, dear atom3d,
I used successfully the LBA dataset with a 30% identity. However, there is a serious contradiction over the dataset with a 60% identity. To be explicit, as written in the paper, the split of 60% identity leads to training, validation, and test sets of sizes 3678, 460, and 460, respectively. However, there are only 3563 in the training while 452 samples in the test sets.
Can you please take a look at the splitting setting again and see whether there was a mistake?
Best,
Thanks for the nice benchmark!
I am wondering how we could prepare data for 3D CNN, is there any function that generates density for it?
Dear Authors,
I saw you provide examples for SMP, which is super helpful. I'm studying LBA and wonder how ENN models LBA, especially how ENN represents proteins. Will you release the example codes these days? Thanks
Hi,
I was really intrigued by this work when I heard the presentation at NeurIPS and wanted to try setting it up by running the pytorch-geometric GNN on PDBBind, but wasn't able to due to various issues. Could you please add some documentation on how to download the dataset, how to generate splits and how to train a model? Various functions seemed to have been moved or renamed (maybe some tests would help to ensure this doesn't happen?).
I think most of the pieces are there and this project has a lot of potential, so I'm looking forward to coming back to it.
I am trying to download the RES dataset using atom3d but the url is incorrect. Following the documentation, I run the following code:
import atom3d.datasets.datasets as da
da.download_dataset('res', 'data')
However, I get the following error message:
`--2021-08-03 15:59:02-- http://1nmsnqayokof9-76l4gyqvodsehnzlxv7/
Resolving 1nmsnqayokof9-76l4gyqvodsehnzlxv7 (1nmsnqayokof9-76l4gyqvodsehnzlxv7)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘1nmsnqayokof9-76l4gyqvodsehnzlxv7’
gzip: stdin: unexpected end of file
tar: Child returned status 1
tar: Error is not recoverable: exiting now`
When I looked at the source code for the download_datasets() function, I noticed that all the other datasets have URLS in the form of https://zendo.org/record/...
. However, the RES dataset tries downloading from 1nmsnqayokof9-76l4gyqvodsehnzlxv7/
which is not a proper URL. I believe this is the problem.
When you get a moment, can you please fix this and provide the correct URL?
Also, this is my first time opening an issue so if more information or context is needed, please let me know!
Traceback (most recent call last):
File "data_TL.py", line 4, in
from rdkit import Chem
File "/home/panfulu/anaconda3/envs/atom3d/lib/python3.6/site-packages/rdkit/Chem/init.py", line 18, in
from rdkit import DataStructs
File "/home/panfulu/anaconda3/envs/atom3d/lib/python3.6/site-packages/rdkit/DataStructs/init.py", line 13, in
from rdkit.DataStructs import cDataStructs
ImportError: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.11' not found (required by /home/panfulu/anaconda3/envs/atom3d/lib/python3.6/site-packages/rdkit/DataStructs/../../../../libRDKitDataStructs.so.1)
When I use the rdkit 2020.09, it will meet a mistake. What is the version of rdkit?
Dear Authors,
I use the LBA model and try to use "model = model.to(device)", but it got the error
"device, dtype, non_blocking = torch._C._nn._parse_to(*args, **kwargs)
ValueError: too many values to unpack (expected 3)"
do you know how to change the device (e.g., cpu to gpu) for the CGModule model?
thanks a lot!
I'm using the recently uploaded LBA dataset in LMDB format. I found that in many examples there is more than one string in the 'seq' attribute, which I understand to be the amino acids sequence of the protein in the complex. Can you explain why there can be multiple sequences for one protein?
Hi there, I have some questions about the data splitting on LBA.
I'm not sure if these three indices txt files are already generated in the previous steps (https://github.com/drorlab/atom3d/blob/master/examples/lba/dataset/prepare_lmdb.py#L156-L161).
Follow-up:
I checked the code base again, and found the identity split function. But I haven't found the scripts for running them. Then the remaining question is how to set the blast_db
variable.
Dear Authors,
I met problems when preparing input for ENN training for SMP, as your example shows. Can you show the preprocessing script? I read your fantastic documentation. But I still need your help during preprocessing for ENN & SMP.
In addition, do you run "python train.py" directly in bash?
Appreciate!
As the README.md said, I executed and met some problem as follows.
The whole log is below,
(env) [ pytorch_geometric]$ python train_qm9.py --target 7 --prefix qm9-u0
Traceback (most recent call last):
File "train_qm9.py", line 10, in
from data_qm9_for_ptgeom import GraphQM9
ModuleNotFoundError: No module named 'data_qm9_for_ptgeom'
Hi, dear authors of atom3d, thanks for providing the data collection. I encounter a problem in understanding the LEP dataset.
If my understanding is correct, each data point has an 'atoms_active' and an 'atoms_inactive'. These two correspond to two different protein-ligand pairs, with one positive label and one negative label. However, there is also another key named 'label'. It takes two sorts of values: A or I.
I guess A stands for active, and I represents inactive? Therefore, it seems contradictory because what this 'label' is used for?
I used
python train.py
--data_dir data/split-by-sequence-identity-30/data/
--mode train
--batch_size 16
--num_epochs 50
--learning_rate 1e-4
(very similar to gnn and enn model training method)
but I got:
Traceback (most recent call last):
File "train.py", line 173, in
parser.add_argument('--data_dir', type=str, default=os.environ['LBA_DATA'])
File "/home/xzhang/miniconda3/envs/atom3d/lib/python3.6/os.py", line 669, in getitem
raise KeyError(key) from None
KeyError: 'LBA_DATA'
I'd like to load the LBA dataset and get the following:
I downloaded the LBA dataset and loaded it in python with:
# Load dataset from directory of PDB files
dataset = da.load_dataset("ligand_binding_affinity", 'pdb')
But how can I get the amino acid sequences, SMILES strings, and binding affinity from this?
Thanks for your time.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xde in position 16: invalid continuation byte
Hello! I was wondering whether you would consider providing your datasets in standard formats for ease of integration into existing frameworks, outside of your API + LMDB files. In particular, I'm looking to evaluate on the LEP dataset with code that operates on PDBbind-like structures (i.e. raw PDB file for protein, mol2/sdf for ligand). Let me know, and thank you!
Hi, as clarified in the atom3D paper, the metrics for the LBD dataset is ‘-log(K)‘. Do we need to further calculate the log negative score by ourselves, or the value of item['scores']['neglog_aff']
is already preprocessed?
Besides, I notice there are two types of proteins: one is the original one and the other is the pocket. Can I use pocket-ligand pair to predict the binding affinity? Since the pocket contains far less atoms.
When I trained the GNN model provided in examples/psr/gnn/model.py using the hyperparameters specified in examples/psr/gnn, I'm getting a per-target Spearman's rho of 0.503 +/- 0.013 on the validation set, about what is reported in the paper. However, I'm only getting a per-target Spearman's rho of 0.405 +/- 0.013 on the test set. For reference, I'm seeing 0.582 +/- 0.007 for the per-target Spearman on the training data after 50 epochs.
Hi there,
Thank you for the nice opensource datasets and useful util functions in Atom3D.
I have some questions about how to acquire the protein sequence in LBA datasets. After downloaded the dataset from Zenodo, I found the key 'seq' in each item is an empty list. I also tried the get_chain_sequences but also got an empty list. I wonder how can I get the AA sequence for the LBA datasets.
It seems we are trying to import from util
instead of from atom3d.util
:
Traceback (most recent call last):
File "atom3d/datasets/lba/create_hdf5.py", line 11, in <module>
from util import datatypes as dt
ImportError: cannot import name 'datatypes'
I changed the imports to:
import atom3d.util.formats as dt
import atom3d.util.file as fi
Thanks!
Allan
Thanks for your really cool work and sharing the repository - your neurips paper's results looked very interesting. I am trying to retrain the 3dcnn model on the residue deletion task using the data splits that you have provided. After some changes to the code, I am stuck at the data loader. The data files you provide look like:
data/residue_deletion/split/train_envs_0000_1000.h5
data/residue_deletion/split/train_envs_0001_1000.h5
Whereas the dataloader in benchmarking/cnn3d/train_resdel.py
currently expects the format in ResDel_Dataset_PT
to be the following:
data = torch.load(os.path.join(self.path, f'data_t_{idx}.pt'))
Can you please provide information on how to convert the h5 files to the format expected by the data loader? I looked at atom3d/datasets/res/convert_resdel_from_hdf5_to_npz.py
but it doesn't look like the right conversion?
Any pointers to get the data loading to work would greatly help.
Thanks,
Meghana.
When i use enn in lba, it seems "cgprod_bounded" is not defined in argsparse and got some errors. Thanks for your time!
I was wondering if it is possible to use atom3D package to predict the properties of a reaction, where the input data could be the structure of the reactant and the product molecules, and the output is a property of the reaction itself.
Thanks for the great work on this package!
I downloaded the LMDB dataset for residue deletion which unzipped to the following folder:
/raw/RES/data/
data.mdb lock.mdb
When I look at the dataframes for each protein structure, the labels are missing.
dataset = da.load_dataset(lmdb_path, 'lmdb')
dataset.get('100d')
{'atoms': ensemble subunit structure model chain hetero insertion_code ... x y z element name fullname serial_number
0 100d.pdb 0 100d.pdb 0 A ... -4.549 5.095 4.262 O O5' O5' 1
1 100d.pdb 0 100d.pdb 0 A ... -4.176 6.323 3.646 C C5' C5' 2
[408 rows x 20 columns], 'id': '100d', 'file_path': '/oak/stanford/groups/rbaltman/aderry/graph-pdb/data/raw/100d.pdb', 'labels': Empty DataFrame
Columns: [subunit, label, x, y, z]
Index: [], 'subunit_indices': [], 'types': {'atoms': "<class 'pandas.core.frame.DataFrame'>", 'id': "<class 'str'>", 'file_path': "<class 'str'>", 'labels': "<class 'pandas.core.frame.DataFrame'>", 'subunit_indices': "<class 'list'>", 'types': "<class 'dict'>"}}
Is the idea that one downloads this slightly reformatted PDB data and then runs some feature generation code (ex: generate voxels for 3D CNN) on top of it? Can you please point to the code that can do this (the current code in this repo still seems to use shards and not the lmdb format)?
Thanks.
Hi, I download and unzip the dataset from https://zenodo.org/record/4914718#.YPFQNegzZPY. However, I found no lmdb file since I want to use 'from atom3d.datasets import LMDBDataset
dataset = LMDBDataset(PATH_TO_LMDB)
print(len(dataset)) # Print length
print(dataset[0]) # Print 1st entry
labels = [item['scores']['neglog_aff'] for item in dataset] # Get all labels'.
Can you give me guidance on how to load the LBA dataset correctly?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.