deepgraphlearning / gearnet Goto Github PK

GearNet and Geometric Pretraining Methods for Protein Structure Representation Learning, ICLR'2023 (https://arxiv.org/abs/2203.06125)

License: MIT License

Python 98.25% Dockerfile 1.75%

graph-neural-networks pre-training protein-representation-learning

gearnet's People

Contributors

Stargazers

Watchers

gearnet's Issues

About Dataset

Hello Dear Author!
For the dataset in the experiment, we have the following confusions:

for Enzyme Commission dataset, I downloaded the dataset, but only get the PDB file, the PDB index of the training set.
But how do I get the labels? I guess the suffix of PDB stands for label? For example 2FOR-A stands for A?
Same for Gene Ontology (GO).
alphafold dataset why there is training set test set validation set?

By the way, I tried using torchDrug, but had a slightly different experience than PyG.

Pre-training on different datasets

Hello, you discussed the results of pre-training on different datasets in the appendix.
As we can see in Table 8, the performance is comparable with real PDB or alphafold (V1 or V2), but real PDB has only 300,000 structures and alphafold has 800,000 structures.
Why the authors use more structures of alphafold in the main text? Finally, theoretically, the larger the dataset, the better the pre-training results, why Table 8 is not valid?

About Fold and Reaction dataset in torchdrug

Hi, sorry for bothering you about Torch drug API. I want to reproduce your results on Fold Classification and Reaction. However, I find that the Fold dataset in torchdrug/datasets/fold/Fold doesn't contain protein structure data? It basically contains Sequences and Labels. According to my understanding of GearNet, it's a structure-based method and the pretrain tasks are also structure-based. So, I am confused about current situation.
Besides, I can't find the reaction dataset in torchdrug. Could you please tell me which dataset you used and the training config like EC and GO.

Sorry again for adding to your trouble. Thank you for open sourcing such a great work.

Training details of basic GearNet on Fold prediction task

Hi, may I ask if you can provide your training details of basic GearNet (without IEConv) on the Fold prediction task?
The results of your paper are in page 8:
28.4 42.6 95.3 for test_fold, test_superfamily, test_family.

Is it here https://github.com/DeepGraphLearning/GearNet/blob/main/config/downstream/Fold3D/gearnet.yaml? I didn't find the basic GearNet model architecture in this repo.

Thanks!

Asking about the default config option `save_interval: 5` when pretraining on AlphaDB

Hi there, I noticed that the default config option save_interval: 5 (

GearNet/config/pretrain/mc_esm_gearnet.yaml

Line 66 in 1a1d15d

save_interval: 5

), when taken by the script pretrain.py, will let the model to train on one pickled part (consisting 220k proteins) of AlphaDB for 5 epoch, and then another pickled part for 5 epochs, and so on so forth. (This option also controls the interval that the model is saved, although it could be also be adjusted independently. )

Could you provide a bit insight on why is it set this way? It it needed for some practical reason to train enough number of epochs on one pickle before moving to the next one? Thank you!

how to use the GearNet to extract the protein feature?

hi, authors, a great work, I want to use the GearNet as a feature extractor to extract the protein features, how to use it?

thanks!!!

Request for guidance on preprocessing PDB files for model input

Hello,
I have come across your fascinating GitHub repository on protein structure pre-training, and I am excited to explore its potential for my own research. I noticed that the provided data is in HDF5 format, and there is no preprocessing code available for PDB files. I would like to use my own PDB files for inference with your model, but I am unsure how to preprocess them to match the expected input format.

Would you be able to provide some guidance or share a sample preprocessing script for converting PDB files to the required HDF5 format? This would greatly help me and other researchers who are interested in utilizing your work for various applications.

Thank you for your time and for sharing your valuable work with the community. I am looking forward to your response and any assistance you can provide.

The pre-trained GearNet-Edge model for Fold Classification

Thank you for your amazing work! I found that for the Fold Classification task, the GearNet-Edge model was implemented based on the GearNetIEConv script rather than the GearNet script, which has some detail differences (e.g., extra input embedding and ieconv layers). Based on this, I would like to ask whether you could provide the pretrained GearNet-Edge model based on multiview contrast learning and the GearNetIEConv script for Fold Classification (rather than based on GearNet script for EC task)? Thank you.

none

Solution when alpha carbon coordinate is missing

Hi, since GearNet models each residue in the protein by the C_alpha coordinates, how do you handle the situation when the coordinate for alpha carbon is missing?

shape mismatch

I encountered a shape mismatch issue during runtime.

File "/home/admin/anaconda3/envs/test_env/lib/python3.7/site-packages/torchdrug-0.2.0-py3.7.egg/torchdrug/layers/conv.py", line 813, in message_and_aggregate
    return update.view(graph.num_node, self.num_relation * self.input_dim)
RuntimeError: shape '[975, 472]' is invalid for input of size 312000

protein structure：

print(protein, protein.node_feature.shape)  # PackedProtein(batch_size=1, num_atoms=[51], num_bonds=[975], num_residues=[51])   torch.Size([51, 21])

Low performance on training from scratch on a single GPU

Hi, I am trying to reproduce the experiments, but the reproduced results have large gaps between the paper results.
Reproduced:
GearNet:
EC: 0.514 (200 epochs)
GO-BP: 0.176 (146 epochs)
GO-CC: 0.145 (84 epochs)
GearNet-Edge:
EC: 0.404 (163 epochs)
GO-BP: 0.255 (100 epochs)
GO-CC: 0.163 (107 epochs)

I use the same configuration and hyperparameter as provided in the rep. Training runs on one single GPU, and the some of the experiments are still under training.

Many thanks

Node classification tasks

Hi! First of all great job! I have been trying to do node classification in residue view, using my own node labels. However, I haven't been able to configure the NodePropertyPrediction task to use those labels instead of predicting the residue features. Do you have any guidance on how I can proceed to do this? Any help is appreciated

How can I load pretrained weights from checkpoint to go on pretraining?

Hello! Thx for your great work!
For some reasons, I couldn't run the whole training loop in your pretraining scripts. But I got some checkpoints like "model_epoch_25.pth". The question is, how can I load this checkpoint and go on finishining my pretraining?
Looking forward to your reply!

confusion on epochs

Hi!
I am wondering the number of epochs in experiment. The epoch is set to 200 for EC stated in the paper, but in the config the epoch is set to 50. Whether should I modify the epochs to 200 for reproducing the experiment?

Thanks for your help!

ValueError: Unknown value `CHI_SQUAREPLANAR`. Available vocabulary is `range(0, 4)`

15:52:05   Config file: ./config/downstream/GO-BP/gearnet_yy.yaml
15:52:05   {'dataset': {'branch': 'BP',
             'class': 'GeneOntology',
             'path': '/scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO/',
             'test_cutoff': 0.95,
             'transform': {'class': 'ProteinView', 'view': 'residue'}},
 'engine': {'batch_size': 2, 'gpus': [0], 'log_interval': 1000},
 'metric': 'f1_max',
 'optimizer': {'class': 'AdamW', 'lr': 0.0001, 'weight_decay': 0},
 'output_dir': '/scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein_output/downstream/GO-BP',
 'task': {'class': 'MultipleBinaryClassification',
          'criterion': 'bce',
          'graph_construction_model': {'class': 'GraphConstruction',
                                       'edge_feature': 'gearnet',
                                       'edge_layers': [{'class': 'SequentialEdge',
                                                        'max_distance': 2},
                                                       {'class': 'SpatialEdge',
                                                        'min_distance': 5,
                                                        'radius': 10.0},
                                                       {'class': 'KNNEdge',
                                                        'k': 10,
                                                        'min_distance': 5}],
                                       'node_layers': [{'class': 'AlphaCarbonNode'}]},
          'metric': ['auprc@micro', 'f1_max'],
          'model': {'batch_norm': True,
                    'class': 'GearNet',
                    'concat_hidden': True,
                    'hidden_dims': [512, 512, 512, 512, 512, 512],
                    'input_dim': 21,
                    'num_relation': 7,
                    'readout': 'sum',
                    'short_cut': True},
          'num_mlp_layer': 3},
 'train': {'num_epoch': 200}}
15:52:05   Downloading https://zenodo.org/record/6622158/files/GeneOntology.zip to /scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO/GeneOntology.zip
15:53:38   Extracting /scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO/GeneOntology.zip to /scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO
15:53:41   Extracting /scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO/GeneOntology/train.zip to /scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO/GeneOntology
15:56:21   Extracting /scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO/GeneOntology/valid.zip to /scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO/GeneOntology
15:56:37   Extracting /scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO/GeneOntology/test.zip to /scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/protein-datasets/downstream/GO/GeneOntology

Constructing proteins from pdbs:   0%|          | 0/36635 [00:00<?, ?it/s]/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/protein.py:213: UserWarning: Unknown residue `HOH`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/feature.py:42: UserWarning: Unknown value `HOH`
  warnings.warn("Unknown value `%s`" % x)
[15:56:55] Explicit valence for atom # 6 O, 3, is greater than permitted
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/protein.py:213: UserWarning: Unknown residue `BIS`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/feature.py:42: UserWarning: Unknown value `BIS`
  warnings.warn("Unknown value `%s`" % x)
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/protein.py:213: UserWarning: Unknown residue `EPE`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/feature.py:42: UserWarning: Unknown value `EPE`
  warnings.warn("Unknown value `%s`" % x)

Constructing proteins from pdbs:   0%|          | 3/36635 [00:00<54:08, 11.28it/s]/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/protein.py:213: UserWarning: Unknown residue `SO4`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/feature.py:42: UserWarning: Unknown value `SO4`
  warnings.warn("Unknown value `%s`" % x)
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/protein.py:213: UserWarning: Unknown residue `PO4`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/feature.py:42: UserWarning: Unknown value `PO4`
  warnings.warn("Unknown value `%s`" % x)
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/protein.py:213: UserWarning: Unknown residue `BME`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)
/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/feature.py:42: UserWarning: Unknown value `BME`
  warnings.warn("Unknown value `%s`" % x)

Constructing proteins from pdbs:   0%|          | 5/36635 [00:00<1:06:38,  9.16it/s]/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/feature.py:42: UserWarning: Unknown value `Fe`
  warnings.warn("Unknown value `%s`" % x)

Constructing proteins from pdbs:   0%|          | 5/36635 [00:00<1:10:20,  8.68it/s]
Traceback (most recent call last):
  File "/scratch/user/yuning.you/project/protein_cross_modal_pretraining/ProteinRepresentation/GearNet/script/downstream.py", line 56, in <module>
    dataset = core.Configurable.load_config_dict(cfg.dataset)
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/core/core.py", line 269, in load_config_dict
    return cls(**new_config)
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/core/core.py", line 288, in wrapper
    return init(self, *args, **kwargs)
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/datasets/gene_ontology.py", line 72, in __init__
    self.load_pdbs(pdb_files, verbose=verbose, **kwargs)
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/dataset.py", line 750, in load_pdbs
    protein = data.Protein.from_molecule(mol, **kwargs)
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/utils/decorator.py", line 192, in wrapper
    return obj(*args, **kwargs)
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/protein.py", line 185, in from_molecule
    protein = Molecule.from_molecule(mol, atom_feature=atom_feature, bond_feature=bond_feature,
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/utils/decorator.py", line 192, in wrapper
    return obj(*args, **kwargs)
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/molecule.py", line 189, in from_molecule
    feature += func(atom)
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/feature.py", line 77, in atom_default
    onehot(atom.GetChiralTag(), chiral_tag_vocab) + \
  File "/scratch/user/yuning.you/.conda/envs/protein/lib/python3.9/site-packages/torchdrug/data/feature.py", line 47, in onehot
    raise ValueError("Unknown value `%s`. Available vocabulary is `%s`" % (x, vocab))
ValueError: Unknown value `CHI_SQUAREPLANAR`. Available vocabulary is `range(0, 4)`

Dear developers,

Thanks for your great work. When I am trying to have a quick run through fine-tuning, via python script/downstream.py -c ./config/downstream/EC/gearnet.yaml --gpus [0], the above error messages are returned before model training (for both EC and GO-BP). I would appreciate your time to help me resolve it.

RuntimeError: CUDA error: the provided PTX was compiled with an unsupported toolchain.

When I want to load the weight like this,

net = torch.load(pthfile)

I got this error:

RuntimeError: CUDA error: the provided PTX was compiled with an unsupported toolchain.

I don't know what happened. After Google, it seems that I need to update my CUDA driver. Are there any other options?

A dataset not found when I run "python script/pretrain.py -c config/pretrain/mc_gearnet_edge.yaml --gpus [0]"

It seems that the file located in "https://ftp.ebi.ac.uk/pub/databases/alphafold/latest/UP000006548_3702_ARATH_v2.tar" really doesn't exist. When I entered this url in my browser, it also noticed me that the file doesn't exist.

14:43:55   Downloading https://ftp.ebi.ac.uk/pub/databases/alphafold/latest/UP000006548_3702_ARATH_v2.tar to /home/horace/scratch/protein-datasets/alphafold/UP000006548_3702_ARATH_v2.tar
Traceback (most recent call last):
  File "script/pretrain.py", line 50, in <module>
    dataset = core.Configurable.load_config_dict(cfg.dataset)
  File "/home/horace/.conda/envs/drug/lib/python3.7/site-packages/torchdrug/core/core.py", line 269, in load_config_dict
    return cls(**new_config)
  File "/home/horace/.conda/envs/drug/lib/python3.7/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/horace/.conda/envs/drug/lib/python3.7/site-packages/torchdrug/core/core.py", line 288, in wrapper
    return init(self, *args, **kwargs)
  File "/home/horace/.conda/envs/drug/lib/python3.7/site-packages/torchdrug/datasets/alphafolddb.py", line 122, in __init__
    tar_file = utils.download(self.urls[species_id], path, md5=self.md5s[species_id])
  File "/home/horace/.conda/envs/drug/lib/python3.7/site-packages/torchdrug/utils/file.py", line 31, in download
    urlretrieve(url, save_file)
  File "/home/horace/.conda/envs/drug/lib/python3.7/urllib/request.py", line 247, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/home/horace/.conda/envs/drug/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/home/horace/.conda/envs/drug/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/home/horace/.conda/envs/drug/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/home/horace/.conda/envs/drug/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/home/horace/.conda/envs/drug/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/home/horace/.conda/envs/drug/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

Pre-trained weight for ESM-GearNet

Howdy, thank you for sharing the amazing work, and may I ask if you have any plans on releasing the pre-trained weights for ESM-GearNet model? Many thanks!

input data

Dear authors, thanks for your great works.
I have some questions about the data. I try to visualize the input protein sequence by call the to_sequence() during each batch. Here is the figure showing the sequence.

I wonder why there are some many .Gs, since . is a separator for multiple sequences (DeepGraphLearning/torchdrug#151). Also, after deleting all the ., the length of remaining sequence is the same as the number of the graph's nodes. Could you help to explain why there are some many .Gs? Many thanks!

TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

Thank you for your great work. However when try to pretrain，I encountered such error.
python script/downstream.py -c config/downstream/EC/gearnet.yaml --gpus null
Traceback (most recent call last):
File "script/downstream.py", line 11, in
from torchdrug import core, models, tasks, datasets, utils
File "/home/hongyan/envs/pyg/lib/python3.7/site-packages/torchdrug/models/init.py", line 10, in
from .esm import EvolutionaryScaleModeling
File "/home/hongyan/envs/pyg/lib/python3.7/site-packages/torchdrug/models/esm.py", line 6, in
import esm
File "/home/hongyan/envs/pyg/lib/python3.7/site-packages/esm/init.py", line 8, in
from .data import Alphabet, RobertaAlphabet, BatchConverter, FastaBatchedDataset # noqa
File "/home/hongyan/envs/pyg/lib/python3.7/site-packages/esm/data.py", line 11, in
from torchvision.datasets.utils import download_url
File "/home/hongyan/envs/pyg/lib/python3.7/site-packages/torchvision/init.py", line 5, in
from torchvision import datasets
File "/home/hongyan/envs/pyg/lib/python3.7/site-packages/torchvision/datasets/init.py", line 1, in
from ._optical_flow import KittiFlow, Sintel, FlyingChairs, FlyingThings3D, HD1K
File "/home/hongyan/envs/pyg/lib/python3.7/site-packages/torchvision/datasets/_optical_flow.py", line 26, in
class FlowDataset(ABC, VisionDataset):
TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases
Here is information of my environment:
torch 1.11.0
torchdrug 0.2.0
pyg 2.0.4

Dealing with proteins with multiple chains

For proteins with multiple chains, did you split them by chain and input the splits into the model one by one, or directly input the whole proteins?

In the section "F ADDITIONAL EXPERIMENTAL RESULTS ON EC AND GO PREDICTION - Pretraining on different datasets" of your paper, you wrote:

Specifically, we extract 123,505
experimentally-determined protein structures from PDB whose resolutions are between 0.0 and 2.5
angstroms, and we further extract 305,265 chains from these proteins to construct the final dataset

which seems to implying that you trained the model on a bunch of single protein chains. However, meanwhile you did experiments of Enzyme Comission code prediction. To my knowledge, there are many enzymes containing more than one chain. It is impossible to split the enzyme into different chains and input into the model respectively (which hardly predicts the enzyme type correctly).

atom view

I was wondering how atom view is implemented? I'm getting a shape mismatch.

In mc-gearnet_edge.yaml I changed the view and entity level to 'atom' and input dimension to 38. As i find 38 atom types in the torchdrug protein class.
Is there another setting i need to change?

Thanks for creating GearNet!

Error in Fold3D config file

Hi! Thank you for your great work!! I just have a quick question. I am trying to use Gearnet with the Fold3D dataset using the configuration file you provided. But I keep on getting this error (I add the screenshot in attach). If I remove mlp_batch_norm and mlp_dropout the code runs, but the model doesn't seem to train properly. I would really appreciate if you could give me your input on that, or let me know what I am doing incorrectly.
Thanks a lot!!

UserWarning: Unknown value

Hi,

Thanks for your wonderful work.

When running

# Run GearNet on the Enzyme Comission dataset with 1 gpu
python script/downstream.py -c config/downstream/EC/gearnet.yaml --gpus [0]

I met the following log:

/home/chenshoufa/workspace/torchdrug/torchdrug/data/protein.py:213: UserWarning: Unknown residue `PT`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)                                                                                                                                           
/home/chenshoufa/workspace/torchdrug/torchdrug/data/feature.py:42: UserWarning: Unknown value ` PT`
  warnings.warn("Unknown value `%s`" % x)                                                                                                                                                                  
Constructing proteins from pdbs:   1%|█▏                                                                                                                             | 172/19198 [00:29<1:01:27,  5.16it/s]
/home/chenshoufa/workspace/torchdrug/torchdrug/data/protein.py:213: UserWarning: Unknown residue `COB`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)                                   
/home/chenshoufa/workspace/torchdrug/torchdrug/data/feature.py:42: UserWarning: Unknown value `COB`
  warnings.warn("Unknown value `%s`" % x)                                                                                                                                                                  
Constructing proteins from pdbs:   1%|█▏                                                                                                                               | 183/19198 [00:30<57:46,  5.48it/s]
/home/chenshoufa/workspace/torchdrug/torchdrug/data/feature.py:42: UserWarning: Unknown value `Be`
  warnings.warn("Unknown value `%s`" % x)                                                            
/home/chenshoufa/workspace/torchdrug/torchdrug/data/protein.py:213: UserWarning: Unknown residue `ADP`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)                                                                                                                                           
/home/chenshoufa/workspace/torchdrug/torchdrug/data/feature.py:42: UserWarning: Unknown value `ADP`
  warnings.warn("Unknown value `%s`" % x)                                                                                                                                                                  
/home/chenshoufa/workspace/torchdrug/torchdrug/data/protein.py:213: UserWarning: Unknown residue `BEF`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)                                   
/home/chenshoufa/workspace/torchdrug/torchdrug/data/feature.py:42: UserWarning: Unknown value `BEF`
  warnings.warn("Unknown value `%s`" % x)                                                                                                                                                                  
Constructing proteins from pdbs:   1%|█▏                                                                                                                             | 186/19198 [00:31<1:01:54,  5.12it/s]
/home/chenshoufa/workspace/torchdrug/torchdrug/data/protein.py:213: UserWarning: Unknown residue `1NB`. Treat as glycine
  warnings.warn("Unknown residue `%s`. Treat as glycine" % type)                                   
/home/chenshoufa/workspace/torchdrug/torchdrug/data/feature.py:42: UserWarning: Unknown value `1NB`

Is it normal?

Thanks in advance.

Explainability

Hello,

Is there any code available for the explainability experiment in Section K in the appendix of your paper https://arxiv.org/pdf/2203.06125.pdf?

Thank you.

secondary structure evaluation

Hi, thank you for your amazing work!

I am trying to evaluate GearNet on secondary structure dataset, but it gives me this error:

AttributeError: 'PackedProtein' object has no attribute 'node_position'

I think it is because secondary structure datset doesn't provide node_position, which is needed for gearnet.

Is there any other way I can evaluate secondary structure on gearnet?

Thank you.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 20) of binary....

When I run
python -m torch.distributed.launch --nproc_per_node=4 script/downstream.py -c config/downstream/GO-BP/gearnet_edge.yaml --gpus [0,1,2,3] --ckpt
on worker*1 Tesla-V100-SXM2-32GB:4 GPU, 47 CPU, I got the error:

[219013] [E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804901 milliseconds before timing out.
[219014] [E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out.
[219015] [E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805985 milliseconds before timing out.
[219016] Traceback (most recent call last):
[219017] File "/hubozhen/GearNet/script/downstream.py", line 75, in
[219018] train_and_validate(cfg, solver, scheduler)
[219019] File "/hubozhen/GearNet/script/downstream.py", line 30, in train_and_validate
[219020] solver.train(**kwargs)
[219021] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/core/engine.py", line 155, in train
[219022] loss, metric = model(batch)
[219023] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219024] return forward_call(*input, **kwargs)
[219025] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
[219026] output = self.module(*inputs[0], **kwargs[0])
[219027] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219028] return forward_call(*input, **kwargs)
[219029] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/tasks/property_prediction.py", line 279, in forward
[219030] pred = self.predict(batch, all_loss, metric)
[219031] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/tasks/property_prediction.py", line 300, in predict
[219032] output = self.model(graph, graph.node_feature.float(), all_loss=all_loss, metric=metric)
[219033] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219034] return forward_call(*input, **kwargs)
[219035] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/models/gearnet.py", line 99, in forward
[219036] edge_hidden = self.edge_layers[i](line_graph, edge_input)
[219037] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219038] return forward_call(*input, **kwargs)
[219039] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/conv.py", line 92, in forward
[219040] output = self.combine(input, update)
[219041] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/conv.py", line 438, in combine
[219042] output = self.batch_norm(output)
[219043] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219044] return forward_call(*input, **kwargs)
[219045] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 758, in forward
[219046] world_size,
[219047] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/_functions.py", line 42, in forward
[219048] dist._all_gather_base(combined_flat, combined, process_group, async_op=False)
[219049] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2070, in _all_gather_base
[219050] work = group._allgather_base(output_tensor, input_tensor)
[219051] RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804901 milliseconds before timing out.
[219052] /opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/functional/functional.py:474: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
[219053] index1 = local_index // local_inner_size + offset1
[219054] /opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/functional/functional.py:474: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
[219055] index1 = local_index // local_inner_size + offset1
[219056] [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[219057] terminate called after throwing an instance of 'std::runtime_error'
[219058] what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804901 milliseconds before timing out.
[219059] [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[219060] terminate called after throwing an instance of 'std::runtime_error'
[219061] what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out.
[219062] /opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/data/graph.py:1667: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
[219063] edge_in_index = local_index // local_inner_size + edge_in_offset
[219064] [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[219065] terminate called after throwing an instance of 'std::runtime_error'
[219066] what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805985 milliseconds before timing out.
[219067] WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 21 closing signal SIGTERM
[219068] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 20) of binary: /opt/anaconda3/envs/manifold/bin/python
[219069] Traceback (most recent call last):
[219070] File "/opt/anaconda3/envs/manifold/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[219071] "main", mod_spec)
[219072] File "/opt/anaconda3/envs/manifold/lib/python3.7/runpy.py", line 85, in _run_code
[219073] exec(code, run_globals)
[219074] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
[219075] main()
[219076] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
[219077] launch(args)
[219078] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
[219079] run(args)
[219080] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
[219081] )(*cmd_args)
[219082] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
[219083] return launch_agent(self._config, self._entrypoint, list(args))
[219084] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
[219085] failures=result.failures,
[219086] torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
[219087] ===================================================
[219088] /hubozhen/GearNet/script/downstream.py FAILED
[219089] ---------------------------------------------------
[219090] Failures:
[219091] [1]:
[219092] time : 2022-12-12_09:41:02
[219093] host : pytorch-7c3c96f1-d9hcm
[219094] rank : 2 (local_rank: 2)
[219095] exitcode : -6 (pid: 22)
[219096] error_file: <N/A>
[219097] traceback : Signal 6 (SIGABRT) received by PID 22
[219098] [2]:
[219099] time : 2022-12-12_09:41:02
[219100] host : pytorch-7c3c96f1-d9hcm
[219101] rank : 3 (local_rank: 3)
[219102] exitcode : -6 (pid: 23)
[219103] error_file: <N/A>
[219104] traceback : Signal 6 (SIGABRT) received by PID 23
[219105] ---------------------------------------------------
[219106] Root Cause (first observed failure):
[219107] [0]:
[219108] time : 2022-12-12_09:41:02
[219109] host : pytorch-7c3c96f1-d9hcm
[219110] rank : 0 (local_rank: 0)
[219111] exitcode : -6 (pid: 20)
[219112] error_file: <N/A>
[219113] traceback : Signal 6 (SIGABRT) received by PID 20
[219114] ===================================================

Someone said this happened when loading big data, I find the use ratios of these for GPUs are 100%.
However, I changed the same procedure on another V100 mechaine (worker*1:
Tesla-V100-SXM-32GB:4 GPU, 48 CPU,), it is OK.
It confused me.

multi-gpu training fails

Hello,

Running

python -m torch.distributed.launch --nproc_per_node=4 script/downstream.py -c config/downstream/EC/gearnet.yaml --gpus [0,1,2,3]

does not succeed with following log:

20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
Loading /home/chenshoufa/scratch/protein-datasets/EnzymeCommission/enzyme_commission.pkl.gz:  64%|██████████████████████████████████████████▉                        | 11854/18515 [08:49<20:55,  5.30it/s]Killing subprocess 1350247
Killing subprocess 1350248
Killing subprocess 1350249
Killing subprocess 1350250
Traceback (most recent call last):
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/chenshoufa/anaconda3/envs/gear/bin/python', '-u', 'script/downstream.py', '--local_rank=3', '-c', 'config/downstream/EC/gearnet.yaml', '--gpus', '[0,1,2,3]']' died with <Signals.SIGKILL: 9>.

Could you help me with this issue?

Request for the pretrained model and instructions on getting own proteins' embeddings.

Thanks for the wonderful work!

I am trying the use the learned embeddings for a downstream protein classification problem on my own datasets. Since training the model requires a good HPC, I am wondering:

whether you could kindly upload your pretrained model.
could you explain how to generate the training and testing datasets (the pkl.gz file) from our own PDB files.
based on the generated pkl.gz file in Q1, how to apply the trained model to get the final embedding vectors (512 dimensions) for our own PDB files.

error in mc_esm_gearnet

Howdy, thank u for ur awesome work in Enhancing Protein Language Models with Structure-based Encoder and Pre-training. I am running the pretaining experiment now, and I am facing an issue of "Can't find atom_feature in features.atom". I paste the error statement below.

It seems like it cannot recognize the atom_feature: null or bond_feature: null. Do I need to change the source code for implementing these two arguments?

Any help will be grateful!

An error occurring when using the StepLR scheduler on the FOLD3D dataset

Hi, thank you for your amazing work!
I tried to reproduce the GearNet results on Fold3D dataset, I followed the original .yaml file in which the StepLR scheduler was specified. However, there was an error occurring when using the scheduler as follows, I would like to ask what causes this, thank you!

15:28:38 #train: 12312, #valid: 736, #test: 718
Traceback (most recent call last):
File "script/downstream.py", line 74, in
solver, scheduler = util.build_downstream_solver(cfg, dataset)
File "/GearNet-new/util.py", line 121, in build_downstream_solver
scheduler = core.Configurable.load_config_dict(cfg.scheduler)
File "/torchdrug/lib/python3.8/site-packages/torchdrug/core/core.py", line 269, in load_config_dict
return cls(**new_config)
File "/torchdrug/lib/python3.8/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/torchdrug/lib/python3.8/site-packages/torchdrug/core/core.py", line 288, in wrapper
return init(self, *args, **kwargs)
File "/torchdrug/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 367, in init
super(StepLR, self).init(optimizer, last_epoch, verbose)
File "/torchdrug/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 367, in init
super(StepLR, self).init(optimizer, last_epoch, verbose)
File "/torchdrug/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 367, in init
super(StepLR, self).init(optimizer, last_epoch, verbose)
[Previous line repeated 991 more times]
RecursionError: maximum recursion depth exceeded while calling a Python object

Error:General Union types are not currently supported. Only Union[T, NoneType] (i.e. Optional[T]) is supported.: File "/home/lvqy/anaconda3/envs/ZernikeMetric/lib/python3.8/site-packages/torch_cluster/rw.py", line 18

Hi, when I execute the command:

python script/pretrain.py -c config/pretrain/mc_gearnet_edge.yaml --gpus [0]

There is an error occurring. How to solve This problem?

RuntimeError:
General Union types are not currently supported. Only Union[T, NoneType] (i.e. Optional[T]) is supported.:
File "/home/lvqy/anaconda3/envs/ZernikeMetric/lib/python3.8/site-packages/torch_cluster/rw.py", line 18
num_nodes: Optional[int] = None,
return_edge_indices: bool = False,
) -> Union[Tensor, Tuple[Tensor, Tensor]]:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
"""Samples random walks of length :obj:walk_length from all node indices
in :obj:start in the graph given by :obj:(row, col) as described in the

Config For Dataset Fold

Hello, I'd like to know whether I can get the configuration file for training the fold dataset?

Reproduce the result for ProteinBERT

Hi, thank you for sharing the work and answering questions. I recently want to reproduce the proteinBERT results as shown in the your paper. However, the performance with directly using the given config file is only about 0.079.

The loss is pretty low on training, validation, and testing, but it seems the model isn't able to classify data correctly. The downstream task is EC.
Do you have any suggestions fixing this issue?

And I also try to use HuggingFace protBERT to rerun the experiments. The result is also around 0.078 and has low loss values. Would you be willing to give any advice on this as well?

Many thanks for your answering!

Any plans on releasing the model's weights?

Hi,
Thank you for your great work. Do you plan on releasing the model's weights any time soon? The README doesn't seem to mention any pretrained model. This would be very helpful to quickly get representations for new sequences.

Asking about implementation of series connection of PLM & GNN in the FusionNetwork.

Hi, I've learned a lot from this great work. Thank you for presenting it in the paper and here!

I wanted to ask about implementation of series connection of PLM & GNN in the FusionNetwork. In the PLM+GNN paper (
Zhang, Z. et al. Enhancing Protein Language Models with Structure-based Encoder and Pre-training. Arxiv (2023) doi:10.48550/arxiv.2303.06275), the authors tested three ways of fusing PLM & GNN and decided to use the series connection. The series connection is described as

Series: we replace the node features of GearNet with the output of ESM-1b and use the output of GearNet as final representations.

In the implementation of FusionNetwork. I saw it indeed uses the output of ESM-1b as the node features of GearNet, but then seems to use the output of GearNet concatenated with the output of ESM-1b as final representations (pasted below). So which is the way that the authors found most effective? Shall one use sole output from GearNet or the concatenated output?

    def forward(self, graph, input, all_loss=None, metric=None):
        output1 = self.sequence_model(graph, input, all_loss, metric)
        node_output1 = output1.get("node_feature", output1.get("residue_feature"))
        output2 = self.structure_model(graph, node_output1, all_loss, metric)
        node_output2 = output2.get("node_feature", output2.get("residue_feature"))
        node_feature = torch.cat([node_output1, node_output2], dim=-1)
        graph_feature = torch.cat([
            output1['graph_feature'], 
            output2['graph_feature']
        ], dim=-1)
        return {
            "graph_feature": graph_feature,
            "node_feature": node_feature
        }

If possible, could you please share some configurations on trying out the "cross" style (quote below) of fusing PLM & GNN? I am interested in testing this option and wanted to learn about the configurations of the transformer (number of layers, hidden dims, number of head) that you have tried.

Cross: we concatenate the output of ESM-1b and GearNet and then feed them into a transformer to perform cross-attention between modalities. The output of the transformer will be used asfinal representations.

About The Pre-training Process

Hey, I am sorry to trouble you about the pre-training details about GeatNet. :)

After pre-training on the AlphaFold, will you fix the model's parameters and only change the prediction head's parameter? Or update the pre-trained model and its corresponding prediction head together?

Non-deterministic embeddings

Hi!

I was wondering if there is any reason that the GearNetIEConv encoder would return variable embeddings for the same input file. I encountered this using my own data, but when I set a torch manual_seed, the embeddings became constant for the same input. And is this expected to have any effect on model performance?

Thanks for your help!

pretrain dataset

Hi, I download about 38k pdb files by using the config files, and paper indicates the pretrain dataset is 805k. Should this be expected? Many thanks!

modify the url

Hello，
I would like to ask how to modify the URL ' https://ftp.ebi.ac.uk/pub/databases/alphafold/latest/UP000006548_3702_ARATH_v2.tar' when using the command 'python script/pretrain.py -c config/pretrain/mc_gearnet_edge.yaml --gpus [0]' to run this code.

a question about the downstream tasks

hi, authors, great works, I notice that the GearNet does 4 downstream tasks, they are 1): EC number prediction, 2): GO term prediction, 3): Fold classification, 4): Reaction classification, I am interested in the GO term prediction task, could the authors release the corresponding dataset about this task? Thanks!!!

Custom dataset. Data preprocessing

Thank you so much for your outstanding work!

I'm interested in your models and would like to run them on some custom datasets. Unfortunately, I haven't found any instructions on how to preprocess the raw data. Could you please tell me whether it is possible to run your models on custom datasets? And if so, where can I find your preprocessing script?

Thank you!

What information that the hidden dimensions respectively represent

Hello! GearNet is really a good work! But I have a problem. I see that the hidden dimension set in the config file is [512,512,512,512,512,512]. Since I don't know much about the specific principle of graph neural network, I want to know what information these dimensions respectively represent.Thank you!

asking about how to obtain the new graph based on contrast learning

Hello, because my code understanding ability is not very strong, I have a little problem in understanding the model:
（Because I am very interested in your work, I am sorry to have a lot of questions~）
Refer to the mc_gearnet_edge.yaml file, the Multiview Contrast in the model is followed by a multi-layer perceptron. However, the output in Multiview Contrast is divided into output1 and output2 consisting of graph features and node features, but there is only one input in MLP.

I would like to ask what is the input in MLP?
what is the model in the MultiviewContrast module?
[["def init(self, model, crop_funcs, noise_funcs, num_mlp_layer=2, activation="relu", tau=0.07):
super(MultiviewContrast, self).init()"]]
is it GeometryAwareRelationalGraphNeuralNetwork?
In addition, which step did you obtain the new graph based on contrast learning mentioned in your article?(because the MultiviewContrast module has two outputs results, I don't know which one is better)

Looking forward to your reply very much！

Edge_list set to [0,0,0]

Hi, thanks for your work!

In the Fold3D dataset class, why is the edge_list field set to an empty edge list of [[0,0,0]] when the input hdf5 files are loaded (on line 85 in dataset.py)? I'm trying to load in my own protein graphs into the GearNetIEConv model to get an output embedding, and if I don't set edge_list=[[0,0,0]], I run into an IndexOutOfBounds error later on when the protein graph is getting passed through the "message" function of the GeometricRelationalGraphConv layer.

RuntimeError: addmm: Argument #3 (dense), for training ESM_GearNet on EC

Hi, thank you for sharing the project. I am trying to reproduce the result of the ESM_GearNet model, and I have some problems on fine-tuning it on EC downstream task. Here is the picture.

I was able to pretrain the model on AlphaFoldDB but failed to fine-tune it or train it from scratch on EC.

Attribution information of the Fold3D dataset

Hi, may I ask if you have a more detailed description of the data structure of the protein .hdf5 file of the Fold3D dataset? I find it contains much information about the protein, but I am not sure what some of them mean.

How to download and preprocess the PDB experimentally-determined protein.

"Specifically, we extract 123,505 experimentally-determined protein structures from PDB"

Hi, appreciate your great work!
I am quite new in this area. Could you please tell me where is the code to download and preprocess those experimentally-determined proteins?

max_length=100 for TruncateProtein in pretrain config files

Hello! Amazing work here. I am curious about a detail of setup of different pretrain tasks specified in the config files (.yaml files). In config of self-prediction tasks, there seems to be a TruncateProtein applied to the AlphaFoldDB dataset with max_length=100, while in config of Multiview Contrast task there isn't. Is similar truncation specified implicitly somewhere else in cases for MC task? Is the truncating using max_length=100 needed to reproduce the results for pretraining on self-prediction tasks?
Thank you!

deepgraphlearning / gearnet Goto Github PK

gearnet's People

Contributors

Stargazers

Watchers

Forkers

gearnet's Issues

Recommend Projects

Recommend Topics

Recommend Org