zjunlp / ontoprotein Goto Github PK

[ICLR 2022] OntoProtein: Protein Pretraining With Gene Ontology Embedding

License: MIT License

Python 97.03% Shell 2.97%

protein-protein-interaction gene-ontology protein-pretraining protein-function-prediction protein-structure-prediction knowledge-graph protein ontoprotein bert nlp

ontoprotein's People

Contributors

Stargazers

Watchers

ontoprotein's Issues

Go-Go relations are the same as Protein-Go relations?

Hello, researchers.

I have a question about your pre-train datasets, when I checked your lists of pre-train datasets, I found only one dataset named relation2id.txt that is related to relations. And I found you use the same file for GO-GO relations and Protein-GO relations in your codes. Therefore, I am confused about this. And I want to ask if the relation file for Protein-GO and GO-GO are the same, if yes, I want to know why they are the same or how you define or think the relations for Protein-Go and GO-GO should be the same.

Thanks.
Best regards,
Xinghao

数据文件

博主你好，请问一下代码数据集中，protein_go_triplet，protein_seq_map.txt文件如何获取？似乎没有在代码中找到

[Confirmation] Optimal Hyperparameters and Reproducibility

Hi there,

Thanks for providing the nice codebase. I'm trying to reproduce the results for downstream tasks, and I have the following questions.

I'm wondering if the scripts under this folder are only samples? For the optimal hyperparameters for OntoProtein, we should follow Table 6 in the paper?
For ProtBert, are you using the same optimal hyperparameters for each downstream task?
Table 6 doesn't cover the optimal values for gradient_accumulation_steps and eval_step. Can you help clarify this?

Any help is appreciated.

Question about computing resource and batch size

Hi,

Thanks for sharing the code. I noticed in your run_pretrain.sh, the batch size of protein-GO and protein MLM is 8, while the batch size of GO-GO is 64. Meanwhile, the num of negative samples for each positive sample is 128, or 256 for GO-GO.

(1) Does this mean in each GO-GO pass, at most (64*2+64*256) samples of length at most 128 are fed into the GO encoder (in one batch)?

(2) How many V100s did you use for this pretraining?

Also, I noticed that you didn't permutate proteins for protein-GO relations.

(3) Is this due to computing resource limit (i.e. 8*128 is just too large a number for proteins)?

(4) Did you experiment with a lower number of negative samples while considering such protein permutation?

Thanks in advance!

the problem of biopython version?

您好，我在Readme上看到的biopython版本号是1.37，但现在官方提供的pip安装没有提供1.37的版本，只有1.30以下，1.30，然后就到1.40以上的版本了，所以我就安了最新的1.78版本，但我现在运行gen_onto_protein_data.py中的create_uniprot_data函数，就给我报了这样的错

，所以我想问下是否是biopython版本的问题？如果是的话，我现在应该选取什么版本比较合适？

run_pretrain.sh 报错

我配置了deepspeed环境，然后运行run_pretrain.sh,但出现了以下错误：

然后我将config属性指向protein_model_config，并且运行了training_arg.py中注释掉的部分，结果出现了以下错误：

File "run_pretrain.py", line 135, in
main()
File "run_pretrain.py", line 131, in main
trainer.train()
File "OntoProtein/src/trainer.py", line 167, in train
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/transformers/deepspeed.py", line 437, in deepspeed_init
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/init.py", line 120, in initialize
engine = DeepSpeedEngine(args=args,
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 239, in init
self._configure_with_arguments(args, mpu)
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 872, in _configure_with_arguments
self._config = DeepSpeedConfig(self.config, mpu)
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 875, in init
self._configure_train_batch_size()
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 1051, in _configure_train_batch_size
self._batch_assertion()
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 987, in _batch_assertion
train_batch > 0
TypeError: '>' not supported between instances of 'str' and 'int'

请问这是什么原因引起的呢？

Trainer problem for pre-training

Hello, researchers
Thanks for your research. I have some problems with the pre-training phases.

did the pre-training code only support the training manner with epochs? Because when I replace the parameter "--max_steps" with "--num_train_epochs," I get an exception. Therefore, I am not sure whether trainer.py support the training manner with epochs.
if Q1 is not, I got the following results for steps, therefore, is the loss for each step meaningful? and Another question, Why does the loss of "mlm" keep oscillating after certain steps? Could you give me any advice about this situation?
{'mlm': 1.3232421875, 'protein_go_ke': 0.66796875, 'go_go_ke': 1.9619140625, 'global_step': 180, 'learning_rate': [9.965326633165831e-06, 9.965326633165831e-06, 1.9930653266331662e-05]} {'mlm': 0.77783203125, 'protein_go_ke': 0.66650390625, 'go_go_ke': 1.857421875, 'global_step': 181, 'learning_rate': [9.964824120603016e-06, 9.964824120603016e-06, 1.9929648241206033e-05]} {'mlm': 0.7373046875, 'protein_go_ke': 0.64111328125, 'go_go_ke': 1.984375, 'global_step': 182, 'learning_rate': [9.964321608040202e-06, 9.964321608040202e-06, 1.9928643216080404e-05]} {'mlm': 0.447509765625, 'protein_go_ke': 2.140625, 'go_go_ke': 2.029296875, 'global_step': 183, 'learning_rate': [9.963819095477387e-06, 9.963819095477387e-06, 1.9927638190954775e-05]} {'mlm': 1.3056640625, 'protein_go_ke': 0.64990234375, 'go_go_ke': 1.91015625, 'global_step': 184, 'learning_rate': [9.963316582914575e-06, 9.963316582914575e-06, 1.992663316582915e-05]} {'mlm': 2.1015625, 'protein_go_ke': 0.6806640625, 'go_go_ke': 1.8505859375, 'global_step': 185, 'learning_rate': [9.96281407035176e-06, 9.96281407035176e-06, 1.992562814070352e-05]} {'mlm': 1.146484375, 'protein_go_ke': 0.6494140625, 'go_go_ke': 1.9150390625, 'global_step': 186, 'learning_rate': [9.962311557788946e-06, 9.962311557788946e-06, 1.992462311557789e-05]} {'mlm': 1.3505859375, 'protein_go_ke': 0.666015625, 'go_go_ke': 1.8994140625, 'global_step': 187, 'learning_rate': [9.961809045226131e-06, 9.961809045226131e-06, 1.9923618090452263e-05]} {'mlm': 1.359375, 'protein_go_ke': 2.775390625, 'go_go_ke': 1.8330078125, 'global_step': 188, 'learning_rate': [9.961306532663317e-06, 9.961306532663317e-06, 1.9922613065326634e-05]} {'mlm': 1.0927734375, 'protein_go_ke': 0.65087890625, 'go_go_ke': 1.8271484375, 'global_step': 189, 'learning_rate': [9.960804020100502e-06, 9.960804020100502e-06, 1.9921608040201005e-05]} [2022-10-13 09:32:21,562] [INFO] [logging.py:68:log_dist] [Rank 0] step=190, skipped=11, lr=[9.96030150753769e-06, 9.96030150753769e-06, 1.992060301507538e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] [2022-10-13 09:32:21,992] [INFO] [timer.py:157:stop] 0/190, SamplesPerSec=4.138257351966589 {'mlm': 1.7353515625, 'protein_go_ke': 0.6669921875, 'go_go_ke': 1.7998046875, 'global_step': 190, 'learning_rate': [9.96030150753769e-06, 9.96030150753769e-06, 1.992060301507538e-05]} {'mlm': 1.2763671875, 'protein_go_ke': 0.6787109375, 'go_go_ke': 1.8154296875, 'global_step': 191, 'learning_rate': [9.959798994974875e-06, 9.959798994974875e-06, 1.991959798994975e-05]} {'mlm': 0.80712890625, 'protein_go_ke': 0.6708984375, 'go_go_ke': 1.876953125, 'global_step': 192, 'learning_rate': [9.95929648241206e-06, 9.95929648241206e-06, 1.991859296482412e-05]} {'mlm': 0.59716796875, 'protein_go_ke': 0.6787109375, 'go_go_ke': 1.7919921875, 'global_step': 193, 'learning_rate': [9.958793969849248e-06, 9.958793969849248e-06, 1.9917587939698496e-05]} {'mlm': 0.7734375, 'protein_go_ke': 0.6611328125, 'go_go_ke': 1.90625, 'global_step': 194, 'learning_rate': [9.958291457286433e-06, 9.958291457286433e-06, 1.9916582914572867e-05]} {'mlm': 0.77587890625, 'protein_go_ke': 0.6865234375, 'go_go_ke': 1.76171875, 'global_step': 195, 'learning_rate': [9.957788944723619e-06, 9.957788944723619e-06, 1.9915577889447238e-05]} {'mlm': 0.89404296875, 'protein_go_ke': 0.6533203125, 'go_go_ke': 1.91015625, 'global_step': 196, 'learning_rate': [9.957286432160806e-06, 9.957286432160806e-06, 1.9914572864321612e-05]} {'mlm': 1.1416015625, 'protein_go_ke': 0.654296875, 'go_go_ke': 1.78125, 'global_step': 197, 'learning_rate': [9.956783919597992e-06, 9.956783919597992e-06, 1.9913567839195983e-05]} {'mlm': 1.0224609375, 'protein_go_ke': 0.66162109375, 'go_go_ke': 1.7841796875, 'global_step': 198, 'learning_rate': [9.956281407035177e-06, 9.956281407035177e-06, 1.9912562814070354e-05]} {'mlm': 0.56005859375, 'protein_go_ke': 0.65966796875, 'go_go_ke': 1.806640625, 'global_step': 199, 'learning_rate': [9.955778894472363e-06, 9.955778894472363e-06, 1.9911557788944725e-05]} [2022-10-13 09:33:02,238] [INFO] [logging.py:68:log_dist] [Rank 0] step=200, skipped=11, lr=[9.955276381909548e-06, 9.955276381909548e-06, 1.9910552763819096e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] [2022-10-13 09:33:02,671] [INFO] [timer.py:157:stop] 0/200, SamplesPerSec=4.141814535981229

Best regards,
Xinghao

Cannot run the code without GPU

I'm trying to run the pre-train script on a server that has no GPU, would that be possible?

run_contact.sh error

Hi,
I setup fresh environment for running the script and when I run [run_contact.sh] I get the following error in "contact-ontoprotein.out"

***** Running Prediction *****
Num examples = 40
Batch size = 1
Traceback (most recent call last):
File "run_downstream.py", line 286, in
main()
File "run_downstream.py", line 281, in main
predictions_family, input_ids_family, metrics_family = trainer.predict(test_dataset)
File "/home/sakher/miniconda3/envs/onto2/lib/python3.8/site-packages/transformers/trainer.py", line 2358, in predict
output = eval_loop(
File "/data3/sakher/onto2/OntoProtein/src/benchmark/trainer.py", line 217, in evaluation_loop
loss, logits, labels, prediction_score = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
File "/data3/sakher/onto2/OntoProtein/src/benchmark/trainer.py", line 50, in prediction_step
prediction_score['precision_at_l2'] = logits[3]['precision_at_l2']
KeyError: 'precision_at_l2'

Generating Embedding of Protein Sequence

Hi is it possible to use https://huggingface.co/zjunlp/OntoProtein model to get the embedding of a protein sequence?

Question: What is the evaluation metric for protein function classification?

Hello,
I want to know what evaluation metric is used for protein function classification.
I scrutinized the paper and code, but could not found how it was evaluated.
In the paper I found the description "4**%** improvement with transductive setting ..." but still cannot understand what is the percentage for.
I'd appreciate your reply.

goatools包问题

博主你好，请问一下您在运行gen_onto_protein_data.py文件中create_go_data部分时是否出现下面类似问题：
goatools版本为1.2.3时，go_term.definition报错：没有.definition属性
goatools版本为1.0.11时，提示RecursionError: maximum recursion depth exceeded while calling a Python object

Exploding KE loss when training a new `protein-go` set

Hi! I was trying to re-train your model with a new set of protein-GO relations, and it looks like the KE losses for positive and negative samples are exploding unexpectedly. After a few steps, the positive one increases to several tens of thousands and the negative one decreases to minus several tens of thousands. It seems they would continue to explode. Have you encountered this issue in your experiments? I'm thinking of confining the score to [0, 1] before inputting into the KE loss. Thanks in advance！

关于GO prediction的表现

再次感谢作者分享代码，我看到文章中对GO预测的数值似乎和最近的另一篇paper呈现出不太一样的趋势（BP, MF, CC的变化），见此文表3，请问主要是因为对GO cutoff的level不同吗？另外，似乎CCO报告的数值小数点多了一位（如果我没有理解错数值的话）。

ImportError in deepspeed.py

Hi,

Thanks for open-sourcing this really cool model.

I'm trying to play around with pretraining it myself, but I run into this ImportError when I run the run_pretrain.sh script. I would greatly appreciate any guidance, thanks!

(onto_env) tomcobley@compute-g-17-147:~/OntoProtein $ bash script/run_pretrain.sh 

[2022-11-26 21:06:55,053] [WARNING] [runner.py:179:fetch_hostfile] Unable to find hostfile, will proceed with training with loc
al resources only.
[2022-11-26 21:06:56,224] [INFO] [runner.py:508:main] cmd = /home/tomcobley/.conda/envs/onto_env/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 run_pretrain.py --do_train --output_dir data/output_data/filtered_ke_text --pretrain_data_dir path/todata/ProteinKG25 --protein_seq_data_file_name swiss_seq --in_memory true --max_protein_seq_length 1024 --model_protein_seq_data true --model_protein_go_data true --model_go_go_data true --use_desc true --max_text_seq_length 128 --dataloader_protein_go_num_workers 1 --dataloader_go_go_num_workers 1 --dataloader_protein_seq_num_workers 1 --num_protein_go_neg_sample 128 --num_go_go_neg_sample 128 --negative_sampling_fn simple_random --protein_go_sample_head false --protein_go_sample_tail true --go_go_sample_head true --go_go_sample_tail true --protein_model_file_name data/model_data/ProtBERT --text_model_file_name data/model_data/PubMedBERT --go_encoder_cls bert --protein_encoder_cls bert --ke_embedding_size 512 --double_entity_embedding_size false --max_steps 60000 --per_device_train_batch_size 4 --weight_decay 0.01 --optimize_memory true --gradient_accumulation_steps 256 --lr_scheduler_type linear --mlm_lambda 1.0 --lm_learning_rate 1e-5 --lm_warmup_steps 50000 --ke_warmup_steps 50000 --ke_lambda 1.0 --ke_learning_rate 2e-5 --ke_max_score 12.0 --ke_score_fn transE --ke_warmup_ratio --seed 2021 --deepspeed dp_config.json --fp16 --dataloader_pin_memory


Traceback (most recent call last):
  File "/home/tomcobley/.conda/envs/onto_env/lib/python3.8/runpy.py", line 185, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/home/tomcobley/.conda/envs/onto_env/lib/python3.8/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/home/tomcobley/OntoProtein/deepspeed.py", line 25, in <module>
    from .dependency_versions_check import dep_version_check
ImportError: attempted relative import with no known parent package

What is the `OntoModel` referenced in `run_pretrain.sh`

Hi!

I am trying to pretrain the model using the datasets and pretrained models mentioned in the README and the run_pretrain.sh script.

However, I have run into a few problems which seem to be due to my choice of TEXT_MODEL_PATH in run_pretrain.sh on this line.

What should this model be?

I assumed this meant the PubMedBERT model mentioned in the README, but execution fails if I use this - am I missing something?

Many thanks in advance!

Inconsistent data statistics between the downloaded dataset and the reported statistics

Hi Authors,

Thanks for the great work!

As I was checking your dataset, I found that the dataset statistics of downloaded dataset are different from those reported in this link.

Specifically, the downloaded valid set and test set is much larger than the reported size. I also found that the valid set contained data from the training set.

Important relation for the protein sequence

Hi,

I see the following relations in the knowledge graph:
['enables_nucleotide_binding', 'enables_metal_ion_binding', 'enables_transferase_activity', 'enables', 'involved_in_signal_transduction', 'involved_in_regulation_of_transcription,_DNA-templated', 'involved_in_phosphorylation', 'involved_in', 'part_of_nucleus', 'part_of_cytoplasm', 'part_of', 'part_of_cytosol', 'part_of_membrane', 'colocalizes_with', 'involved_in_proteolysis', 'NOT|involved_in', 'part_of_integral_component_of_membrane', 'involved_in_cation_transport', 'involved_in_cellular_response_to_DNA_damage_stimulus', 'part_of_mitochondrion', 'involved_in_metabolic_process', 'involved_in_cell_cycle', 'involved_in_cell_division', 'involved_in_lipid_metabolic_process', 'enables_RNA_binding', 'acts_upstream_of_or_within', 'enables_catalytic_activity', 'enables_hydrolase_activity', 'enables_DNA_binding', 'contributes_to', 'involved_in_carbohydrate_metabolic_process', 'involved_in_translation', 'part_of_extracellular_region', 'acts_upstream_of_or_within_positive_effect', 'involved_in_protein_transport', 'NOT|enables', 'acts_upstream_of', 'part_of_ribosome', 'involved_in_transmembrane_transport', 'NOT|part_of', 'NOT|involved_in_tRNA_processing', 'is_active_in', 'located_in', 'NOT|located_in', 'acts_upstream_of_positive_effect']

which relation is the most important for protein sequence?

Graph knowledge creation

Hello,
I have been looking at this repo for the last couple of days.
I'm interested in graph knowledge generation from the go features.
How did you manage to create the graph?
Is there any snippet of code or detailed documentation?
Thank you!

goa_uniprot_all.gaf和goa_uniprot_all.gat有什么不同?

你好，我想问下在你readme文件里面，我看你们用的是goa_uniprot_all.gat,但在gen_onto_protein_data.py里create_goa_triplet用的又是goa_uniprot_all.gaf来构建triplet，所以我想确定下应该用哪一个文件？

Backward Error during Pre-training

Dear authors:
Thank you for your inspiring work !
When we try to follow your work and run the following scripts for pre-training

python -m torch.distributed.launch --nproc_per_node=4 run_pretrain.py \
  --output_dir 'output' \
  --do_train \
  --in_memory $IN_MEMORY \
  --max_protein_seq_length $MAX_PROTEIN_SEQ_LENGTH \
  --model_protein_seq_data true \
  --model_protein_go_data true \
  --model_go_go_data true \
  --use_desc $USE_DESC \
  --max_text_seq_length $MAX_TEXT_SEQ_LENGTH \
  --dataloader_protein_go_num_workers $PROTEIN_GO_NUM_WORKERS \
  --dataloader_go_go_num_workers $GO_GO_NUM_WORKERS \
  --dataloader_protein_seq_num_workers $PROTEIN_SEQ_NUM_WORKERS \
  --num_protein_go_neg_sample $NUM_PROTEIN_GO_NEG_SAMPLE \
  --num_go_go_neg_sample $NUM_GO_GO_NEG_SAMPLE \
  --negative_sampling_fn $NEGTIVE_SAMPLING_FN \
  --protein_go_sample_head $PROTEIN_GO_SAMPLE_HEAD \
  --protein_go_sample_tail $PROTEIN_GO_SAMPLE_TAIL \
  --go_go_sample_head $GO_GO_SAMPLE_HEAD \
  --go_go_sample_tail $GO_GO_SAMPLE_TAIL \
  --protein_model_file_name $PROTEIN_MODEL_PATH \
  --text_model_file_name $TEXT_MODEL_PATH \
  --go_encoder_cls $GO_ENCODER_CLS \
  --protein_encoder_cls $PROTEIN_ENCODER_CLS \
  --ke_embedding_size $KE_EMBEDDING_SIZE \
  --double_entity_embedding_size $DOUBLE_ENTITY_EMBEDDING_SIZE \
  --max_steps $MAX_STEPS \
  --per_device_train_batch_size $BATCH_SIZE \
  --weight_decay $WEIGHT_DECAY \
  --optimize_memory $OPTIMIZE_MEMORY \
  --gradient_accumulation_steps $ACCUMULATION_STEPS \
  --lr_scheduler_type $SCHEDULER_TYPE \
  --mlm_lambda $MLM_LAMBDA \
  --lm_learning_rate $MLM_LEARNING_RATE \
  --lm_warmup_ratio $LM_WARMUP_RATIO \
  --ke_warmup_ratio $KE_WARMUP_RATIO \
  --ke_lambda $KE_LAMBDA \
  --ke_learning_rate $KE_LEARNING_RATE \
  --ke_max_score $KE_MAX_SCORE \
  --ke_score_fn $KE_SCORE_FN \
  --ke_warmup_ratio $KE_WARMUP_RATIO \
  --seed 2021 \
  --fp16 \
  --dataloader_pin_memory \

we get error :

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 486 with name protein_lm.bert.encoder.layer.29.output.LayerNorm.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.RuntimeError

It seems that this error only appears when performing DDP training, besides we have tried deep speed but an error occurs during loading optimizer at line 168 in trainer.py.

Wonder how I can solve this. Thank you very much!

关于GO的embedding

您好，请问您们预训练好的GO的embedding可以分享一下吗？

预训练数据缺失问题

您好，我正在尝试运行预训练部分的代码。但我发现您提供的ProteinKG25中并不包含swiss_seq文件夹及其下面包含的mdb数据文件。并且这些数据会在后续的预训练中使用到。请问这部分数据是否有办法通过ProteinKG25生成呢，还是只能通过原始数据生成？

Rationale for choosing this loss function

Regarding your KE loss function, could you kindly provide some intuitions on why this specific loss function was chosen (given there are so many metric learning losses on KG)? A few relevant pieces of literature that you referenced would be appreciated.

Issue in creating an environment for OntoProtein pretraining

Hello Researchers,
I am finding the bugs in installing the deepspeed of version=0.5.1. I have already installed python 3.8.13, pytorch=1.12.0 with torch vision=0.13.0, torch audio=0.12.0, and cudatookit=11.3.1, tranformers=4.9.2, lmdb=1.3.0. But when I install the deepspeed=0.5.1. My all dependencies are not installed correctly for deepspeed. can you please tell the exact versions which you have used for pytorch, python, and deepspeed?
Below is the error which I found:
Traceback (most recent call last):
File "", line 1, in
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/deepspeed/init.py", line 15, in
from .runtime.engine import DeepSpeedEngine, DeepSpeedOptimizerCallable, DeepSpeedSchedulerCallable
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 20, in
from tensorboardX import SummaryWriter
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/init.py", line 5, in
from .torchvis import TorchVis
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/torchvis.py", line 11, in
from .writer import SummaryWriter
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/writer.py", line 15, in
from .event_file_writer import EventFileWriter
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 28, in
from .proto import event_pb2
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/proto/event_pb2.py", line 15, in
from tensorboardX.proto import summary_pb2 as tensorboardX_dot_proto_dot_summary__pb2
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/proto/summary_pb2.py", line 15, in
from tensorboardX.proto import tensor_pb2 as tensorboardX_dot_proto_dot_tensor__pb2
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/proto/tensor_pb2.py", line 15, in
from tensorboardX.proto import resource_handle_pb2 as tensorboardX_dot_proto_dot_resource__handle__pb2
File "//mnt/user1/envs/pretraining/lib/python3.8/site-packages/tensorboardX/proto/resource_handle_pb2.py", line 35, in
_descriptor.FieldDescriptor(
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 560, in new
_message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:

Downgrade the protobuf package to 3.20.x or lower.
Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
Can you tell the exact versions which you have used?

No such file or directory 'protein_go_train_triplet_filtered.txt'

The file 'protein_go_train_triplet.txt' exists in ProteinKG25 , but the code requires 'protein_go_train_triplet_filtered.txt'

Computational Resources and Time

Can you provide a recommendation for the allocated resources of computational power to run one of the downstream tasks, like run_contact including fine-tuning the mode, i.e do_train = True , like the suggest number of cores and memory and how long it is expected to take. And what were the ones used in experiments and how long it took?

I am trying to run the protein contact prediction task on 16 cores and 120GB of memory with an estimation of a week required to get the results, however, I keep getting the process killed because of the insufficient memory space.

Running Code in Google Colab

Hi,

I am really interested in your work on OntoProtein . Is your code run on Google Colab? It would be helpful if you have google Colab tutorials to run your code.

ontoProtein pretrained model

您好，我想使用ontoProtein计算蛋白质的embedding，我在https://huggingface.co/zjukg/OntoProtein/tree/main
上下载了模型保存在本地，但是不同蛋白计算的embedding是一样的，请问这样正常吗？

下载文件到本地
包括config.json pytorch_model.bin tokenizer_config.json vocab.txt四个文件
计算embedding的脚本

import logging
import os
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple
from torch.utils.data.dataloader import DataLoader
import yaml
import os
import numpy as np
import torch
from tqdm import tqdm
from transformers import (
    AutoConfig,
    AutoTokenizer,
    AutoModel,
)
import argparse
import torch
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
logger = logging.getLogger(__name__)
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
model_name_or_path = '/data/wenyuhao/55/model/ontology'
config = AutoConfig.from_pretrained(model_name_or_path,)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,use_fast=False,)
model = AutoModel.from_pretrained(model_name_or_path,config=config,).to(device)
def getArray(seq):
    input_ids = torch.tensor(tokenizer.encode(seq)).unsqueeze(0).to(device)  # Batch size 1
    with  torch.no_grad():
        outputs = model(input_ids)
    return outputs[1].cpu().numpy()

效果

In [14]: a = getArray('VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS')
In [15]: b = getArray('YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF')
In [16]: a
Out[16]: 
array([[-0.11852779,  0.1262154 , -0.11203501, ...,  0.11941278,
         0.11056887, -0.12232994]], dtype=float32)
In [17]: b
Out[17]: 
array([[-0.11852779,  0.1262154 , -0.11203501, ...,  0.11941278,
         0.11056887, -0.12232994]], dtype=float32)
In [18]: Counter(a[0]==b[0])
Out[18]: Counter({True: 1024})

我计算了swissprot的所有蛋白质，发现都是一样的

In [29]: s
Out[29]: 
array([[-0.11852774,  0.12621534, -0.11203495, ...,  0.11941272,
         0.11056883, -0.12232988],
       [-0.11852774,  0.12621534, -0.11203495, ...,  0.11941272,
         0.11056883, -0.12232988],
       [-0.11852774,  0.12621534, -0.11203495, ...,  0.11941272,
         0.11056883, -0.12232988],
       ...,
       [-0.11852774,  0.12621534, -0.11203495, ...,  0.11941272,
         0.11056883, -0.12232988],
       [-0.11852774,  0.12621534, -0.11203495, ...,  0.11941272,
         0.11056883, -0.12232988],
       [-0.11852774,  0.12621534, -0.11203495, ...,  0.11941272,
         0.11056883, -0.12232988]], dtype=float32)

In [30]: s.shape
Out[30]: (20083, 1024)

In [31]: (s==s).all()
Out[31]: True

No module named 'transformers.deepspeed'

When I tried to run the sample command sh run_main.sh ......

Traceback (most recent call last):
  File "run_downstream.py", line 8, in <module>
    from src.models import model_mapping, load_adam_optimizer_and_scheduler
  File "/mnt/SSD2/pmtnet_proj/code/github/OntoProtein/src/models.py", line 16, in <module>
    from transformers.deepspeed import is_deepspeed_zero3_enabled
ModuleNotFoundError: No module named 'transformers.deepspeed'

请问是否有和ESM-1b的benchmark结果

感谢分享非常有趣的论文的代码！请问在这些任务上有和ESM-1b的比较结果吗？另外，请问TAPE上报告的metric都是finetune (supervised)结果而不是unsupervised结果是吗？

Question: Is it possible get the embedding of the relations and entities separately using Ontoprotein ?

Hi,

I am just curious is it possible to get the embedding of proteins and relations in https://www.zjukg.org/project/ProteinKG25/ separately using huggingface ontoprotein model ( https://huggingface.co/zjunlp/OntoProtein)? Let us say,

Protein_1 Enables Protein_2 is triple in the knowledge graph.

Is it to get the embedding of Protein_1, Enables, Protein_2 separately using https://huggingface.co/zjunlp/OntoProtein

zjunlp / ontoprotein Goto Github PK

ontoprotein's People

Contributors

Stargazers

Watchers

Forkers

ontoprotein's Issues

Recommend Projects

Recommend Topics

Recommend Org