Code Monkey home page Code Monkey logo

atom3d's People

Contributors

awfderry avatar everyday847 avatar martinvoegele avatar psuriana avatar raphtown avatar yiannil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

atom3d's Issues

Trained models available?

Hello!
Are any of the trained models (model parameters/weights) available? For instance the 3D CNN trained on the residue deletion task.

Thanks.

Hyperparameters to reproduce the GCN performance on LBA

I'd like to know what batch size and the number of epochs were used to produce the result of GCN on LBA in the paper. In train_pdbbind.py, the batch size is set to 1 and the number of epochs is 100. However, the test RMSE I got using this setting more than ten times higher than the one reported in the paper.

I met some difficulties when I installed your library and use it

First of all, thank you so much for your great repo!

  1. Could you provide a tutorial for installations? When I followed your instructions, I found I have to install pyrr, torch-geometric.
  2. I couldn't find atom3d.models.ff when I run from atom3d.models.ff import FeedForward
  3. when I load dataset, using dataset = da.dataset('data/test_lmdb', 'lmdb', transform=tr.graph-transform), I got model 'atom3d.util.transforms' has no attribute 'graph_transform
  4. which directory should I put the downloaded dataset into?
  5. could you add reamde.rst into /examples/lba/cnn3d like /enn and /gnn?

I think you are changing dataload, there were some conflicts. If you would like to provide some demos, that will be a great help.
many thanks

question on documentation

Hi, on Example section of this page https://atom3d.readthedocs.io/en/latest/using_datasets.html

When demonstrating how to Extract all atoms within 5 Angstroms of a ligand,
you have following code
`

lig_coords = fo.get_coordinates_from_df(atoms_df[atoms_df['subunit']=='LIG']) # get coords of ligand
df_filtered = distance_filter(atoms_df, lig_coords, dist=5.0)`

But I find that the data type of subunit is int64, so this line of code looks wrong to me, is there something I missing?

Thanks in advance!

Guidance of running example

Dear Authors,

Thanks for your great works.

I tried to use this repo. But I met some problems running example script.

The first problem is how to create dataset from scratch. For example, in README file, what is "PATH_TO_INPUT_DIR"? Should I download raw data and put it into PATH_TO_INPUT_DIR? Or it is already in the repo?

Then, after I process the dataset, should I run "python data.py" & "python trian.py" in the ./example folder directly?

Thanks in advance for your time!

issue about tr.GraphTransform

Hi, I have an question about da.load_dataset.

I prepared a new 'lmdb' from pdb, and add the label based on the tutorial. However, after reading from da.load_dataset, the 'y' does not show a single value, instead it would show in a form of Dataframe. please see below. Could you advice me how to deal with this?

train_dataset = da.load_dataset(PATH_TO_LMDB_OUTPUT, 'lmdb',  transform=tr.GraphTransform(atom_key='atoms', label_key='label'))
for i in train_dataset[0]:
    print(i)

('y', label
0 5.72556)

Thus, when doing for the following command, the batch.y does not show the format of a tensor.

batch in train_loader:
    print(batch.y)

The way I generated the lmdb is :

def add_label(item):
    # Remove the file ending ".pdb" from the ID
    name = item['id'][:-4]
    # Get label data
    label_file = os.path.join(PATH_TO_LABELS_DIR, name+'.csv')
    # Add label data in form of a data frame
    item['label'] = pd.read_csv(label_file)
    return item

## Load dataset from directory of PDB files
dataset = da.load_dataset(PATH_TO_INPUT_DIR, 'pdb', transform=add_label)
# Create LMDB dataset from PDB dataset
da.make_lmdb_dataset(dataset, PATH_TO_LMDB_OUTPUT)

If possible, could you give an example of the content in the csv file for add_label? I'd just like to make sure the format is right.
Thank you in advance to looking into this question.

Questions on `chain` in LEP

Hi there,

Thank you for providing the code base.

I have some questions on the LEP dataset. So it seems that there are multiple values for the chain in atoms, e.g. I have the following 5 chain sets for all atoms:

  • {'L', 'A'}
  • {'L', 'D', 'E', 'B', 'G'}
  • {'C', 'A', 'B', 'L'}
  • {'L', 'A', 'B'}
  • {'L', 'D', 'C'}

'L' stands for the ligands, but what about others? And according to this function, you are saying any chain that is not 'L' is treated to be in the pocket? Including 'A', 'B', 'C', 'D', 'E', 'G' right?

Just want to double-check to better understand the dataset. Any help is appreciated.

LBA Dataset Confirmation

Hi there,

Thank you for providing the code base. I have one question about LBA dataset.

After downloading, I found there are 4,463 datapoints under folder pdbbind_2019-refined-set, however, on the PDBBind website, it shows that pdbbind_2018 version has the same number of datapoints. So I just want to double-check which version are you using?

Didn't use gpu even 'torch.cuda.is_available()' returns True when I used enn

I followed https://github.com/drorlab/atom3d/blob/master/examples/lba/enn/README.rst, used
cd atom3d/examples/lba/enn
python train.py --target neglog_aff --load
--prefix lba-id30_cutoff-06_maxnumat-600
--datadir $LMDBDIR --format lmdb
--cgprod-bounded
--radius 6 --maxnum 600
--batch-size 1 --num-epoch 150

my terminal and log file show "Beginning training on CUDA/GPU! Device: 0", but nvidia-smi shows No running processes found.
It happened when I tried the lep example. However, at the same virtual env, I tried https://github.com/drorlab/cormorant, python examples/train_qm9.py. It worked well.

Benchmark for Atom3D

Hi @drorlab,

Thank you for contributing Atom3D. I would like to know where can we know the benchmark information for those datasets? Are you going to maintain a benchmark just like OGB?
thank you!

Identity30_splits for PDBBind

Hi, thank you for the efforts towards a common resource for structural biology benchmarks. I was currently working with the PDBBind dataset and wanted to test the model on the splits you provided. From the link for the dataset (https://www.atom3d.ai/lba.html), I was able to find the identity_60 splits, but could not find the identity_30 splits.

Could you please link me to the same ?

What does the edge feature stand for?

Hello.

I'm doing an SMP experiment, and I found that the edge feature dimension is 4.

However, I couldn't figure out what each element stands for.

Could you give me some instruction about the features of edges of molecule?

Thank you.

Dataset Preparation

If I wants to prepare a new dataset for LBA task, what should I do to deal with my data?

I prepared protein pdb files with corresponding ligand sdf files for this task.

And I see atom3d/atom3d/datasets/lba/process_pdbbind.py that generate three files of format (sdf, cif, cif). But in prepare_lmdb.py need file with formats: (sdf, pdb, pdb) correspondingly.

In my kind of view, I should first transform my dataset to generate (ligand, pocket, protein) fileset with process_pdbbind.py and then use prepare_lmdb.py to generate final hdf5 file for further loading by LMDBDataset. But it's clearly that now I cannot do in that way.

There also exists multiple scripts like create_hdf5.py that seems to be another way to generate hdf5 lmdb file.

So HOW should I do in depth to prepare?

I'm feeling super confusing on preparing new dataset and doing specific task with atom3d.

Trouble reproducing GNN results on LBA using provided code

Hi, I'm unable to reproduce the results quoted in the paper for the performance achieved by the baseline GNN model on the LBA dataset with 30% sequence identity split. I'm following atom3d/examples/lba/gnn to download the dataset and train the model with hyperparameters given in the README. Over 6 runs with different seeds, I'm getting 1.58 +- 0.04 for test RMSE, 0.53 +- 0.03 for test Pearson and 0.54 +- 0.04 for test Spearman. Only the RMSE is consistent with what's reported in the paper. Can the authors confirm that the hyperparameters and code given in atom3d/examples/lba/gnn are identical to what's used to produce the results in the paper? Thanks!

Cannot find lmdb files in LBA dataset

I cannot find any lmdb files from the LBA dataset. I downloaded the dataset from the link posted here, and it only contained pdbbind_3dcnn.h5 and some txt files. I wonder how I can access the lmdb dataset.

Importing xyz file

I was trying to make a lmdb file from an xyz file through atom3d.datasets, but I keep receiving this error:
File "$DIR/python3.7/site-packages/atom3d/util/formats.py", line 128, in df_to_bps
atom['element'])
File "$DIR/python3.7/site-packages/Bio/PDB/Atom.py", line 96, in init
assert not element or element == element.upper(), element
AssertionError: Si

The first line of my xyz file is: Si -0.04553300 -1.14436300 0.73426900
But I assume that Si is a proper element. How can I fix this issue?

Graphs are not bidirectional

Hi,

prot_df_to_graph and mol_df_to_graph result in graphs with connections that go only one way, which is due to the output of scipy's query_pairs. A possible fix would be to use:

edges = torch.cat((edges, edges.flip(dims=(0,))), dim=1)

Loading PPI Dataset with PyTorch Geometric

Hello.

First off, I want to say that it is a sight for sore eyes seeing a repository as well documented as this. I am very impressed given the scope of it.

On another note, I did encounter an odd ModuleNotFoundError when trying to the train_ppi.py script locally in the provided "geometric" Conda environment:

:665: in _load_unlocked
???
../../../../../anaconda3/envs/geometric/lib/python3.6/site-packages/_pytest/assertion/rewrite.py:170: in exec_module
exec(co, module.dict)
train_ppi.py:11: in
import ppi_dataloader as dl
ppi_dataloader.py:15: in
import atom3d.torch.graph as gr
E ModuleNotFoundError: No module named 'atom3d.torch'

It appears as though a directory (atom3d.torch) did not get added to version control and, as such, is not appearing remotely. Was this omission intentional?

Thank you for your time!

`res` dataset source appears to be broken

When I try to pull the res dataset I get:

>>> da.download_dataset(out_path="/path/to/data/atom3d-data", name="res", split="cath-topology")
--2022-02-05 12:50:33--  http://1rjeayyofn0y6pgnljyg0fy5fkqaopoqc/
Resolving 1rjeayyofn0y6pgnljyg0fy5fkqaopoqc (1rjeayyofn0y6pgnljyg0fy5fkqaopoqc)... failed: Name or service not known.
wget: unable to resolve host address ‘1rjeayyofn0y6pgnljyg0fy5fkqaopoqc’

gzip: stdin: unexpected end of file
tar: Child returned status 1
tar: Error is not recoverable: exiting now

With or without splits this breaks. Pulling the link from the file and checking its also broken.

PSR dataset improvement

Hello, thank you for providing such a great dataset collection to the community. I was recently working on the PSR dataset and noticed some possible improvements can be made. Mainly, about maintainability and some more information.

  1. It would be beneficial to keep the dataset up to date. For example at the time of the dataset publication, CASP 11 stage 2 was selected as a test set. I think this can be updated to a more recent version. This can require some versioning at the dataset level to make it consistent. It can be named according to the last full CASP name.
  2. It is related to the last point, currently, it is challenging to extend it and keep it consistent. Maybe some guidelines can be helpful.
  3. I couldn't find a way to tell if a sample is from stage 1 or stage 2. I'm not sure if I'm missing something but there is no information available for custom splits.

Again, thank you for sharing your great work.

How to evaluate and tune each model?

Based on lba problems, for each model: cnn3d, enn, and gnn, how to tune the models if I use my own data? How can I evaluate the models and do predictions? If you can provide examples, that will be a great help. Many thanks

Severe contradiction over the LBA dataset at 60% identity

Hi, dear atom3d,

I used successfully the LBA dataset with a 30% identity. However, there is a serious contradiction over the dataset with a 60% identity. To be explicit, as written in the paper, the split of 60% identity leads to training, validation, and test sets of sizes 3678, 460, and 460, respectively. However, there are only 3563 in the training while 452 samples in the test sets.

Can you please take a look at the splitting setting again and see whether there was a mistake?

Best,

example for LBA

Dear Authors,

I saw you provide examples for SMP, which is super helpful. I'm studying LBA and wonder how ENN models LBA, especially how ENN represents proteins. Will you release the example codes these days? Thanks

Training GNN on PDBBind

Hi,

I was really intrigued by this work when I heard the presentation at NeurIPS and wanted to try setting it up by running the pytorch-geometric GNN on PDBBind, but wasn't able to due to various issues. Could you please add some documentation on how to download the dataset, how to generate splits and how to train a model? Various functions seemed to have been moved or renamed (maybe some tests would help to ensure this doesn't happen?).

I think most of the pieces are there and this project has a lot of potential, so I'm looking forward to coming back to it.

Incorrect URL to Download RES Dataset

I am trying to download the RES dataset using atom3d but the url is incorrect. Following the documentation, I run the following code:

import atom3d.datasets.datasets as da
da.download_dataset('res', 'data')

However, I get the following error message:

`--2021-08-03 15:59:02-- http://1nmsnqayokof9-76l4gyqvodsehnzlxv7/
Resolving 1nmsnqayokof9-76l4gyqvodsehnzlxv7 (1nmsnqayokof9-76l4gyqvodsehnzlxv7)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘1nmsnqayokof9-76l4gyqvodsehnzlxv7’

gzip: stdin: unexpected end of file
tar: Child returned status 1
tar: Error is not recoverable: exiting now`

When I looked at the source code for the download_datasets() function, I noticed that all the other datasets have URLS in the form of https://zendo.org/record/.... However, the RES dataset tries downloading from 1nmsnqayokof9-76l4gyqvodsehnzlxv7/ which is not a proper URL. I believe this is the problem.

When you get a moment, can you please fix this and provide the correct URL?

Also, this is my first time opening an issue so if more information or context is needed, please let me know!

What is the version of rdkit?

Traceback (most recent call last):
File "data_TL.py", line 4, in
from rdkit import Chem
File "/home/panfulu/anaconda3/envs/atom3d/lib/python3.6/site-packages/rdkit/Chem/init.py", line 18, in
from rdkit import DataStructs
File "/home/panfulu/anaconda3/envs/atom3d/lib/python3.6/site-packages/rdkit/DataStructs/init.py", line 13, in
from rdkit.DataStructs import cDataStructs
ImportError: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.11' not found (required by /home/panfulu/anaconda3/envs/atom3d/lib/python3.6/site-packages/rdkit/DataStructs/../../../../libRDKitDataStructs.so.1)

When I use the rdkit 2020.09, it will meet a mistake. What is the version of rdkit?

model.to(device)

Dear Authors,

I use the LBA model and try to use "model = model.to(device)", but it got the error

"device, dtype, non_blocking = torch._C._nn._parse_to(*args, **kwargs)
ValueError: too many values to unpack (expected 3)"

do you know how to change the device (e.g., cpu to gpu) for the CGModule model?

thanks a lot!

Multiple sequences for one protein

I'm using the recently uploaded LBA dataset in LMDB format. I found that in many examples there is more than one string in the 'seq' attribute, which I understand to be the amino acids sequence of the protein in the complex. Can you explain why there can be multiple sequences for one protein?

how do i get processed data for ENN training

Dear Authors,

I met problems when preparing input for ENN training for SMP, as your example shows. Can you show the preprocessing script? I read your fantastic documentation. But I still need your help during preprocessing for ENN & SMP.

In addition, do you run "python train.py" directly in bash?

Appreciate!

Face No module named 'data_qm9_for_ptgeom' when training qm9

As the README.md said, I executed and met some problem as follows.
The whole log is below,
(env) [ pytorch_geometric]$ python train_qm9.py --target 7 --prefix qm9-u0
Traceback (most recent call last):
File "train_qm9.py", line 10, in
from data_qm9_for_ptgeom import GraphQM9
ModuleNotFoundError: No module named 'data_qm9_for_ptgeom'

Questions about the LEP dataset

Hi, dear authors of atom3d, thanks for providing the data collection. I encounter a problem in understanding the LEP dataset.

If my understanding is correct, each data point has an 'atoms_active' and an 'atoms_inactive'. These two correspond to two different protein-ligand pairs, with one positive label and one negative label. However, there is also another key named 'label'. It takes two sorts of values: A or I.

I guess A stands for active, and I represents inactive? Therefore, it seems contradictory because what this 'label' is used for?

How to train the cnn model?

I used
python train.py
--data_dir data/split-by-sequence-identity-30/data/
--mode train
--batch_size 16
--num_epochs 50
--learning_rate 1e-4
(very similar to gnn and enn model training method)
but I got:

Traceback (most recent call last):
File "train.py", line 173, in
parser.add_argument('--data_dir', type=str, default=os.environ['LBA_DATA'])
File "/home/xzhang/miniconda3/envs/atom3d/lib/python3.6/os.py", line 669, in getitem
raise KeyError(key) from None
KeyError: 'LBA_DATA'

Loading LBA dataset.

I'd like to load the LBA dataset and get the following:

  1. Amino acid sequences
  2. SMILES strings for the molecules
  3. Binding affinity values

I downloaded the LBA dataset and loaded it in python with:

# Load dataset from directory of PDB files
dataset = da.load_dataset("ligand_binding_affinity", 'pdb')

But how can I get the amino acid sequences, SMILES strings, and binding affinity from this?

Thanks for your time.

Providing data access in standard formats

Hello! I was wondering whether you would consider providing your datasets in standard formats for ease of integration into existing frameworks, outside of your API + LMDB files. In particular, I'm looking to evaluate on the LEP dataset with code that operates on PDBbind-like structures (i.e. raw PDB file for protein, mol2/sdf for ligand). Let me know, and thank you!

Questions about targets of LBA in atom3d paper

Hi, as clarified in the atom3D paper, the metrics for the LBD dataset is ‘-log(K)‘. Do we need to further calculate the log negative score by ourselves, or the value of item['scores']['neglog_aff'] is already preprocessed?

Besides, I notice there are two types of proteins: one is the original one and the other is the pocket. Can I use pocket-ligand pair to predict the binding affinity? Since the pocket contains far less atoms.

GNN PSR benchmark

When I trained the GNN model provided in examples/psr/gnn/model.py using the hyperparameters specified in examples/psr/gnn, I'm getting a per-target Spearman's rho of 0.503 +/- 0.013 on the validation set, about what is reported in the paper. However, I'm only getting a per-target Spearman's rho of 0.405 +/- 0.013 on the test set. For reference, I'm seeing 0.582 +/- 0.007 for the per-target Spearman on the training data after 50 epochs.

Question for protein sequence in LBA dataset

Hi there,
Thank you for the nice opensource datasets and useful util functions in Atom3D.
I have some questions about how to acquire the protein sequence in LBA datasets. After downloaded the dataset from Zenodo, I found the key 'seq' in each item is an empty list. I also tried the get_chain_sequences but also got an empty list. I wonder how can I get the AA sequence for the LBA datasets.

Can't execute create_hdf5.py

It seems we are trying to import from util instead of from atom3d.util:

Traceback (most recent call last):
  File "atom3d/datasets/lba/create_hdf5.py", line 11, in <module>
    from util import datatypes as dt
ImportError: cannot import name 'datatypes'

I changed the imports to:

import atom3d.util.formats as dt
import atom3d.util.file as fi

Thanks!
Allan

Training 3D CNN on residue deletion

Thanks for your really cool work and sharing the repository - your neurips paper's results looked very interesting. I am trying to retrain the 3dcnn model on the residue deletion task using the data splits that you have provided. After some changes to the code, I am stuck at the data loader. The data files you provide look like:

data/residue_deletion/split/train_envs_0000_1000.h5
data/residue_deletion/split/train_envs_0001_1000.h5

Whereas the dataloader in benchmarking/cnn3d/train_resdel.py currently expects the format in ResDel_Dataset_PT to be the following:
data = torch.load(os.path.join(self.path, f'data_t_{idx}.pt'))

Can you please provide information on how to convert the h5 files to the format expected by the data loader? I looked at atom3d/datasets/res/convert_resdel_from_hdf5_to_npz.py but it doesn't look like the right conversion?

Any pointers to get the data loading to work would greatly help.

Thanks,
Meghana.

lba bugs

When i use enn in lba, it seems "cgprod_bounded" is not defined in argsparse and got some errors. Thanks for your time!

Atom3D for a reaction dataset

I was wondering if it is possible to use atom3D package to predict the properties of a reaction, where the input data could be the structure of the reactant and the product molecules, and the output is a property of the reaction itself.

Missing labels in res-del dataset

Thanks for the great work on this package!
I downloaded the LMDB dataset for residue deletion which unzipped to the following folder:
/raw/RES/data/
data.mdb lock.mdb

When I look at the dataframes for each protein structure, the labels are missing.
dataset = da.load_dataset(lmdb_path, 'lmdb')

 dataset.get('100d')
{'atoms':      ensemble  subunit structure  model chain hetero insertion_code  ...      x      y       z element  name  fullname  serial_number
0    100d.pdb        0  100d.pdb      0     A                        ... -4.549  5.095   4.262       O   O5'       O5'              1
1    100d.pdb        0  100d.pdb      0     A                        ... -4.176  6.323   3.646       C   C5'       C5'              2

[408 rows x 20 columns], 'id': '100d', 'file_path': '/oak/stanford/groups/rbaltman/aderry/graph-pdb/data/raw/100d.pdb', 'labels': Empty DataFrame
Columns: [subunit, label, x, y, z]
Index: [], 'subunit_indices': [], 'types': {'atoms': "<class 'pandas.core.frame.DataFrame'>", 'id': "<class 'str'>", 'file_path': "<class 'str'>", 'labels': "<class 'pandas.core.frame.DataFrame'>", 'subunit_indices': "<class 'list'>", 'types': "<class 'dict'>"}}

Is the idea that one downloads this slightly reformatted PDB data and then runs some feature generation code (ex: generate voxels for 3D CNN) on top of it? Can you please point to the code that can do this (the current code in this repo still seems to use shards and not the lmdb format)?

Thanks.

Error downloading LBA dataset

Hi, I download and unzip the dataset from https://zenodo.org/record/4914718#.YPFQNegzZPY. However, I found no lmdb file since I want to use 'from atom3d.datasets import LMDBDataset
dataset = LMDBDataset(PATH_TO_LMDB)
print(len(dataset)) # Print length
print(dataset[0]) # Print 1st entry
labels = [item['scores']['neglog_aff'] for item in dataset] # Get all labels'.

Can you give me guidance on how to load the LBA dataset correctly?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.