zjunlp / ontoprotein Goto Github PK
View Code? Open in Web Editor NEW[ICLR 2022] OntoProtein: Protein Pretraining With Gene Ontology Embedding
License: MIT License
[ICLR 2022] OntoProtein: Protein Pretraining With Gene Ontology Embedding
License: MIT License
Hello, researchers.
I have a question about your pre-train datasets, when I checked your lists of pre-train datasets, I found only one dataset named relation2id.txt that is related to relations. And I found you use the same file for GO-GO relations and Protein-GO relations in your codes. Therefore, I am confused about this. And I want to ask if the relation file for Protein-GO and GO-GO are the same, if yes, I want to know why they are the same or how you define or think the relations for Protein-Go and GO-GO should be the same.
Thanks.
Best regards,
Xinghao
博主你好,请问一下代码数据集中,protein_go_triplet,protein_seq_map.txt文件如何获取?似乎没有在代码中找到
Hi there,
Thanks for providing the nice codebase. I'm trying to reproduce the results for downstream tasks, and I have the following questions.
gradient_accumulation_steps
and eval_step
. Can you help clarify this?Any help is appreciated.
Hi,
Thanks for sharing the code. I noticed in your run_pretrain.sh
, the batch size of protein-GO and protein MLM is 8, while the batch size of GO-GO is 64. Meanwhile, the num of negative samples for each positive sample is 128, or 256 for GO-GO.
(1) Does this mean in each GO-GO pass, at most (64*2+64*256) samples of length at most 128 are fed into the GO encoder (in one batch)?
(2) How many V100s did you use for this pretraining?
Also, I noticed that you didn't permutate proteins for protein-GO relations.
(3) Is this due to computing resource limit (i.e. 8*128 is just too large a number for proteins)?
(4) Did you experiment with a lower number of negative samples while considering such protein permutation?
Thanks in advance!
我配置了deepspeed环境,然后运行run_pretrain.sh,但出现了以下错误:
File "run_pretrain.py", line 135, in
main()
File "run_pretrain.py", line 131, in main
trainer.train()
File "OntoProtein/src/trainer.py", line 167, in train
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/transformers/deepspeed.py", line 405, in deepspeed_init
hf_deepspeed_config.trainer_config_finalize(args, model, num_training_steps)
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/transformers/deepspeed.py", line 267, in trainer_config_finalize
hidden_size = model.config.hidden_size
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'OntoProteinPreTrainedModel' object has no attribute 'config'
然后我将config属性指向protein_model_config,并且运行了training_arg.py中注释掉的部分,结果出现了以下错误:
File "run_pretrain.py", line 135, in
main()
File "run_pretrain.py", line 131, in main
trainer.train()
File "OntoProtein/src/trainer.py", line 167, in train
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/transformers/deepspeed.py", line 437, in deepspeed_init
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/init.py", line 120, in initialize
engine = DeepSpeedEngine(args=args,
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 239, in init
self._configure_with_arguments(args, mpu)
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 872, in _configure_with_arguments
self._config = DeepSpeedConfig(self.config, mpu)
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 875, in init
self._configure_train_batch_size()
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 1051, in _configure_train_batch_size
self._batch_assertion()
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 987, in _batch_assertion
train_batch > 0
TypeError: '>' not supported between instances of 'str' and 'int'
请问这是什么原因引起的呢?
Hello, researchers
Thanks for your research. I have some problems with the pre-training phases.
{'mlm': 1.3232421875, 'protein_go_ke': 0.66796875, 'go_go_ke': 1.9619140625, 'global_step': 180, 'learning_rate': [9.965326633165831e-06, 9.965326633165831e-06, 1.9930653266331662e-05]} {'mlm': 0.77783203125, 'protein_go_ke': 0.66650390625, 'go_go_ke': 1.857421875, 'global_step': 181, 'learning_rate': [9.964824120603016e-06, 9.964824120603016e-06, 1.9929648241206033e-05]} {'mlm': 0.7373046875, 'protein_go_ke': 0.64111328125, 'go_go_ke': 1.984375, 'global_step': 182, 'learning_rate': [9.964321608040202e-06, 9.964321608040202e-06, 1.9928643216080404e-05]} {'mlm': 0.447509765625, 'protein_go_ke': 2.140625, 'go_go_ke': 2.029296875, 'global_step': 183, 'learning_rate': [9.963819095477387e-06, 9.963819095477387e-06, 1.9927638190954775e-05]} {'mlm': 1.3056640625, 'protein_go_ke': 0.64990234375, 'go_go_ke': 1.91015625, 'global_step': 184, 'learning_rate': [9.963316582914575e-06, 9.963316582914575e-06, 1.992663316582915e-05]} {'mlm': 2.1015625, 'protein_go_ke': 0.6806640625, 'go_go_ke': 1.8505859375, 'global_step': 185, 'learning_rate': [9.96281407035176e-06, 9.96281407035176e-06, 1.992562814070352e-05]} {'mlm': 1.146484375, 'protein_go_ke': 0.6494140625, 'go_go_ke': 1.9150390625, 'global_step': 186, 'learning_rate': [9.962311557788946e-06, 9.962311557788946e-06, 1.992462311557789e-05]} {'mlm': 1.3505859375, 'protein_go_ke': 0.666015625, 'go_go_ke': 1.8994140625, 'global_step': 187, 'learning_rate': [9.961809045226131e-06, 9.961809045226131e-06, 1.9923618090452263e-05]} {'mlm': 1.359375, 'protein_go_ke': 2.775390625, 'go_go_ke': 1.8330078125, 'global_step': 188, 'learning_rate': [9.961306532663317e-06, 9.961306532663317e-06, 1.9922613065326634e-05]} {'mlm': 1.0927734375, 'protein_go_ke': 0.65087890625, 'go_go_ke': 1.8271484375, 'global_step': 189, 'learning_rate': [9.960804020100502e-06, 9.960804020100502e-06, 1.9921608040201005e-05]} [2022-10-13 09:32:21,562] [INFO] [logging.py:68:log_dist] [Rank 0] step=190, skipped=11, lr=[9.96030150753769e-06, 9.96030150753769e-06, 1.992060301507538e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] [2022-10-13 09:32:21,992] [INFO] [timer.py:157:stop] 0/190, SamplesPerSec=4.138257351966589 {'mlm': 1.7353515625, 'protein_go_ke': 0.6669921875, 'go_go_ke': 1.7998046875, 'global_step': 190, 'learning_rate': [9.96030150753769e-06, 9.96030150753769e-06, 1.992060301507538e-05]} {'mlm': 1.2763671875, 'protein_go_ke': 0.6787109375, 'go_go_ke': 1.8154296875, 'global_step': 191, 'learning_rate': [9.959798994974875e-06, 9.959798994974875e-06, 1.991959798994975e-05]} {'mlm': 0.80712890625, 'protein_go_ke': 0.6708984375, 'go_go_ke': 1.876953125, 'global_step': 192, 'learning_rate': [9.95929648241206e-06, 9.95929648241206e-06, 1.991859296482412e-05]} {'mlm': 0.59716796875, 'protein_go_ke': 0.6787109375, 'go_go_ke': 1.7919921875, 'global_step': 193, 'learning_rate': [9.958793969849248e-06, 9.958793969849248e-06, 1.9917587939698496e-05]} {'mlm': 0.7734375, 'protein_go_ke': 0.6611328125, 'go_go_ke': 1.90625, 'global_step': 194, 'learning_rate': [9.958291457286433e-06, 9.958291457286433e-06, 1.9916582914572867e-05]} {'mlm': 0.77587890625, 'protein_go_ke': 0.6865234375, 'go_go_ke': 1.76171875, 'global_step': 195, 'learning_rate': [9.957788944723619e-06, 9.957788944723619e-06, 1.9915577889447238e-05]} {'mlm': 0.89404296875, 'protein_go_ke': 0.6533203125, 'go_go_ke': 1.91015625, 'global_step': 196, 'learning_rate': [9.957286432160806e-06, 9.957286432160806e-06, 1.9914572864321612e-05]} {'mlm': 1.1416015625, 'protein_go_ke': 0.654296875, 'go_go_ke': 1.78125, 'global_step': 197, 'learning_rate': [9.956783919597992e-06, 9.956783919597992e-06, 1.9913567839195983e-05]} {'mlm': 1.0224609375, 'protein_go_ke': 0.66162109375, 'go_go_ke': 1.7841796875, 'global_step': 198, 'learning_rate': [9.956281407035177e-06, 9.956281407035177e-06, 1.9912562814070354e-05]} {'mlm': 0.56005859375, 'protein_go_ke': 0.65966796875, 'go_go_ke': 1.806640625, 'global_step': 199, 'learning_rate': [9.955778894472363e-06, 9.955778894472363e-06, 1.9911557788944725e-05]} [2022-10-13 09:33:02,238] [INFO] [logging.py:68:log_dist] [Rank 0] step=200, skipped=11, lr=[9.955276381909548e-06, 9.955276381909548e-06, 1.9910552763819096e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] [2022-10-13 09:33:02,671] [INFO] [timer.py:157:stop] 0/200, SamplesPerSec=4.141814535981229
Best regards,
Xinghao
I'm trying to run the pre-train script on a server that has no GPU, would that be possible?
Hi,
I setup fresh environment for running the script and when I run [run_contact.sh] I get the following error in "contact-ontoprotein.out"
***** Running Prediction *****
Num examples = 40
Batch size = 1
Traceback (most recent call last):
File "run_downstream.py", line 286, in
main()
File "run_downstream.py", line 281, in main
predictions_family, input_ids_family, metrics_family = trainer.predict(test_dataset)
File "/home/sakher/miniconda3/envs/onto2/lib/python3.8/site-packages/transformers/trainer.py", line 2358, in predict
output = eval_loop(
File "/data3/sakher/onto2/OntoProtein/src/benchmark/trainer.py", line 217, in evaluation_loop
loss, logits, labels, prediction_score = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
File "/data3/sakher/onto2/OntoProtein/src/benchmark/trainer.py", line 50, in prediction_step
prediction_score['precision_at_l2'] = logits[3]['precision_at_l2']
KeyError: 'precision_at_l2'
Hi is it possible to use https://huggingface.co/zjunlp/OntoProtein model to get the embedding of a protein sequence?
Hello,
I want to know what evaluation metric is used for protein function classification.
I scrutinized the paper and code, but could not found how it was evaluated.
In the paper I found the description "4**%** improvement with transductive setting ..." but still cannot understand what is the percentage for.
I'd appreciate your reply.
博主你好,请问一下您在运行gen_onto_protein_data.py文件中create_go_data部分时是否出现下面类似问题:
goatools版本为1.2.3时,go_term.definition报错:没有.definition属性
goatools版本为1.0.11时,提示RecursionError: maximum recursion depth exceeded while calling a Python object
Hi! I was trying to re-train your model with a new set of protein-GO relations, and it looks like the KE losses for positive and negative samples are exploding unexpectedly. After a few steps, the positive one increases to several tens of thousands and the negative one decreases to minus several tens of thousands. It seems they would continue to explode. Have you encountered this issue in your experiments? I'm thinking of confining the score to [0, 1] before inputting into the KE loss. Thanks in advance!
再次感谢作者分享代码,我看到文章中对GO预测的数值似乎和最近的另一篇paper呈现出不太一样的趋势(BP, MF, CC的变化),见此文表3,请问主要是因为对GO cutoff的level不同吗?另外,似乎CCO报告的数值小数点多了一位(如果我没有理解错数值的话)。
Hi,
Thanks for open-sourcing this really cool model.
I'm trying to play around with pretraining it myself, but I run into this ImportError when I run the run_pretrain.sh
script. I would greatly appreciate any guidance, thanks!
(onto_env) tomcobley@compute-g-17-147:~/OntoProtein $ bash script/run_pretrain.sh
[2022-11-26 21:06:55,053] [WARNING] [runner.py:179:fetch_hostfile] Unable to find hostfile, will proceed with training with loc
al resources only.
[2022-11-26 21:06:56,224] [INFO] [runner.py:508:main] cmd = /home/tomcobley/.conda/envs/onto_env/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 run_pretrain.py --do_train --output_dir data/output_data/filtered_ke_text --pretrain_data_dir path/todata/ProteinKG25 --protein_seq_data_file_name swiss_seq --in_memory true --max_protein_seq_length 1024 --model_protein_seq_data true --model_protein_go_data true --model_go_go_data true --use_desc true --max_text_seq_length 128 --dataloader_protein_go_num_workers 1 --dataloader_go_go_num_workers 1 --dataloader_protein_seq_num_workers 1 --num_protein_go_neg_sample 128 --num_go_go_neg_sample 128 --negative_sampling_fn simple_random --protein_go_sample_head false --protein_go_sample_tail true --go_go_sample_head true --go_go_sample_tail true --protein_model_file_name data/model_data/ProtBERT --text_model_file_name data/model_data/PubMedBERT --go_encoder_cls bert --protein_encoder_cls bert --ke_embedding_size 512 --double_entity_embedding_size false --max_steps 60000 --per_device_train_batch_size 4 --weight_decay 0.01 --optimize_memory true --gradient_accumulation_steps 256 --lr_scheduler_type linear --mlm_lambda 1.0 --lm_learning_rate 1e-5 --lm_warmup_steps 50000 --ke_warmup_steps 50000 --ke_lambda 1.0 --ke_learning_rate 2e-5 --ke_max_score 12.0 --ke_score_fn transE --ke_warmup_ratio --seed 2021 --deepspeed dp_config.json --fp16 --dataloader_pin_memory
Traceback (most recent call last):
File "/home/tomcobley/.conda/envs/onto_env/lib/python3.8/runpy.py", line 185, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/home/tomcobley/.conda/envs/onto_env/lib/python3.8/runpy.py", line 111, in _get_module_details
__import__(pkg_name)
File "/home/tomcobley/OntoProtein/deepspeed.py", line 25, in <module>
from .dependency_versions_check import dep_version_check
ImportError: attempted relative import with no known parent package
Hi!
I am trying to pretrain the model using the datasets and pretrained models mentioned in the README and the run_pretrain.sh
script.
However, I have run into a few problems which seem to be due to my choice of TEXT_MODEL_PATH
in run_pretrain.sh
on this line.
What should this model be?
I assumed this meant the PubMedBERT model mentioned in the README, but execution fails if I use this - am I missing something?
Many thanks in advance!
Hi Authors,
Thanks for the great work!
As I was checking your dataset, I found that the dataset statistics of downloaded dataset are different from those reported in this link.
Specifically, the downloaded valid set and test set is much larger than the reported size. I also found that the valid set contained data from the training set.
Hi,
I see the following relations in the knowledge graph:
['enables_nucleotide_binding', 'enables_metal_ion_binding', 'enables_transferase_activity', 'enables', 'involved_in_signal_transduction', 'involved_in_regulation_of_transcription,_DNA-templated', 'involved_in_phosphorylation', 'involved_in', 'part_of_nucleus', 'part_of_cytoplasm', 'part_of', 'part_of_cytosol', 'part_of_membrane', 'colocalizes_with', 'involved_in_proteolysis', 'NOT|involved_in', 'part_of_integral_component_of_membrane', 'involved_in_cation_transport', 'involved_in_cellular_response_to_DNA_damage_stimulus', 'part_of_mitochondrion', 'involved_in_metabolic_process', 'involved_in_cell_cycle', 'involved_in_cell_division', 'involved_in_lipid_metabolic_process', 'enables_RNA_binding', 'acts_upstream_of_or_within', 'enables_catalytic_activity', 'enables_hydrolase_activity', 'enables_DNA_binding', 'contributes_to', 'involved_in_carbohydrate_metabolic_process', 'involved_in_translation', 'part_of_extracellular_region', 'acts_upstream_of_or_within_positive_effect', 'involved_in_protein_transport', 'NOT|enables', 'acts_upstream_of', 'part_of_ribosome', 'involved_in_transmembrane_transport', 'NOT|part_of', 'NOT|involved_in_tRNA_processing', 'is_active_in', 'located_in', 'NOT|located_in', 'acts_upstream_of_positive_effect']
which relation is the most important for protein sequence?
Hello,
I have been looking at this repo for the last couple of days.
I'm interested in graph knowledge generation from the go features.
How did you manage to create the graph?
Is there any snippet of code or detailed documentation?
Thank you!
你好,我想问下在你readme文件里面,我看你们用的是goa_uniprot_all.gat,但在gen_onto_protein_data.py里create_goa_triplet用的又是goa_uniprot_all.gaf来构建triplet,所以我想确定下应该用哪一个文件?
Dear authors:
Thank you for your inspiring work !
When we try to follow your work and run the following scripts for pre-training
python -m torch.distributed.launch --nproc_per_node=4 run_pretrain.py \
--output_dir 'output' \
--do_train \
--in_memory $IN_MEMORY \
--max_protein_seq_length $MAX_PROTEIN_SEQ_LENGTH \
--model_protein_seq_data true \
--model_protein_go_data true \
--model_go_go_data true \
--use_desc $USE_DESC \
--max_text_seq_length $MAX_TEXT_SEQ_LENGTH \
--dataloader_protein_go_num_workers $PROTEIN_GO_NUM_WORKERS \
--dataloader_go_go_num_workers $GO_GO_NUM_WORKERS \
--dataloader_protein_seq_num_workers $PROTEIN_SEQ_NUM_WORKERS \
--num_protein_go_neg_sample $NUM_PROTEIN_GO_NEG_SAMPLE \
--num_go_go_neg_sample $NUM_GO_GO_NEG_SAMPLE \
--negative_sampling_fn $NEGTIVE_SAMPLING_FN \
--protein_go_sample_head $PROTEIN_GO_SAMPLE_HEAD \
--protein_go_sample_tail $PROTEIN_GO_SAMPLE_TAIL \
--go_go_sample_head $GO_GO_SAMPLE_HEAD \
--go_go_sample_tail $GO_GO_SAMPLE_TAIL \
--protein_model_file_name $PROTEIN_MODEL_PATH \
--text_model_file_name $TEXT_MODEL_PATH \
--go_encoder_cls $GO_ENCODER_CLS \
--protein_encoder_cls $PROTEIN_ENCODER_CLS \
--ke_embedding_size $KE_EMBEDDING_SIZE \
--double_entity_embedding_size $DOUBLE_ENTITY_EMBEDDING_SIZE \
--max_steps $MAX_STEPS \
--per_device_train_batch_size $BATCH_SIZE \
--weight_decay $WEIGHT_DECAY \
--optimize_memory $OPTIMIZE_MEMORY \
--gradient_accumulation_steps $ACCUMULATION_STEPS \
--lr_scheduler_type $SCHEDULER_TYPE \
--mlm_lambda $MLM_LAMBDA \
--lm_learning_rate $MLM_LEARNING_RATE \
--lm_warmup_ratio $LM_WARMUP_RATIO \
--ke_warmup_ratio $KE_WARMUP_RATIO \
--ke_lambda $KE_LAMBDA \
--ke_learning_rate $KE_LEARNING_RATE \
--ke_max_score $KE_MAX_SCORE \
--ke_score_fn $KE_SCORE_FN \
--ke_warmup_ratio $KE_WARMUP_RATIO \
--seed 2021 \
--fp16 \
--dataloader_pin_memory \
we get error :
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 486 with name protein_lm.bert.encoder.layer.29.output.LayerNorm.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.RuntimeError
It seems that this error only appears when performing DDP training, besides we have tried deep speed but an error occurs during loading optimizer at line 168 in trainer.py.
Wonder how I can solve this. Thank you very much!
您好,请问您们预训练好的GO的embedding可以分享一下吗?
您好,我正在尝试运行预训练部分的代码。但我发现您提供的ProteinKG25中并不包含swiss_seq文件夹及其下面包含的mdb数据文件。并且这些数据会在后续的预训练中使用到。请问这部分数据是否有办法通过ProteinKG25生成呢,还是只能通过原始数据生成?
Regarding your KE loss function, could you kindly provide some intuitions on why this specific loss function was chosen (given there are so many metric learning losses on KG)? A few relevant pieces of literature that you referenced would be appreciated.
Hello Researchers,
I am finding the bugs in installing the deepspeed of version=0.5.1. I have already installed python 3.8.13, pytorch=1.12.0 with torch vision=0.13.0, torch audio=0.12.0, and cudatookit=11.3.1, tranformers=4.9.2, lmdb=1.3.0. But when I install the deepspeed=0.5.1. My all dependencies are not installed correctly for deepspeed. can you please tell the exact versions which you have used for pytorch, python, and deepspeed?
Below is the error which I found:
Traceback (most recent call last):
File "", line 1, in
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/deepspeed/init.py", line 15, in
from .runtime.engine import DeepSpeedEngine, DeepSpeedOptimizerCallable, DeepSpeedSchedulerCallable
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 20, in
from tensorboardX import SummaryWriter
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/init.py", line 5, in
from .torchvis import TorchVis
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/torchvis.py", line 11, in
from .writer import SummaryWriter
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/writer.py", line 15, in
from .event_file_writer import EventFileWriter
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 28, in
from .proto import event_pb2
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/proto/event_pb2.py", line 15, in
from tensorboardX.proto import summary_pb2 as tensorboardX_dot_proto_dot_summary__pb2
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/proto/summary_pb2.py", line 15, in
from tensorboardX.proto import tensor_pb2 as tensorboardX_dot_proto_dot_tensor__pb2
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/proto/tensor_pb2.py", line 15, in
from tensorboardX.proto import resource_handle_pb2 as tensorboardX_dot_proto_dot_resource__handle__pb2
File "//mnt/user1/envs/pretraining/lib/python3.8/site-packages/tensorboardX/proto/resource_handle_pb2.py", line 35, in
_descriptor.FieldDescriptor(
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 560, in new
_message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
The file 'protein_go_train_triplet.txt' exists in ProteinKG25 , but the code requires 'protein_go_train_triplet_filtered.txt'
Can you provide a recommendation for the allocated resources of computational power to run one of the downstream tasks, like run_contact including fine-tuning the mode, i.e do_train = True , like the suggest number of cores and memory and how long it is expected to take. And what were the ones used in experiments and how long it took?
I am trying to run the protein contact prediction task on 16 cores and 120GB of memory with an estimation of a week required to get the results, however, I keep getting the process killed because of the insufficient memory space.
Hi,
I am really interested in your work on OntoProtein . Is your code run on Google Colab? It would be helpful if you have google Colab tutorials to run your code.
您好,我想使用ontoProtein计算蛋白质的embedding,我在https://huggingface.co/zjukg/OntoProtein/tree/main
上下载了模型保存在本地,但是不同蛋白计算的embedding是一样的,请问这样正常吗?
config.json pytorch_model.bin tokenizer_config.json vocab.txt
四个文件import logging
import os
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple
from torch.utils.data.dataloader import DataLoader
import yaml
import os
import numpy as np
import torch
from tqdm import tqdm
from transformers import (
AutoConfig,
AutoTokenizer,
AutoModel,
)
import argparse
import torch
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
logger = logging.getLogger(__name__)
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
model_name_or_path = '/data/wenyuhao/55/model/ontology'
config = AutoConfig.from_pretrained(model_name_or_path,)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,use_fast=False,)
model = AutoModel.from_pretrained(model_name_or_path,config=config,).to(device)
def getArray(seq):
input_ids = torch.tensor(tokenizer.encode(seq)).unsqueeze(0).to(device) # Batch size 1
with torch.no_grad():
outputs = model(input_ids)
return outputs[1].cpu().numpy()
In [14]: a = getArray('VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS')
In [15]: b = getArray('YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF')
In [16]: a
Out[16]:
array([[-0.11852779, 0.1262154 , -0.11203501, ..., 0.11941278,
0.11056887, -0.12232994]], dtype=float32)
In [17]: b
Out[17]:
array([[-0.11852779, 0.1262154 , -0.11203501, ..., 0.11941278,
0.11056887, -0.12232994]], dtype=float32)
In [18]: Counter(a[0]==b[0])
Out[18]: Counter({True: 1024})
我计算了swissprot的所有蛋白质,发现都是一样的
In [29]: s
Out[29]:
array([[-0.11852774, 0.12621534, -0.11203495, ..., 0.11941272,
0.11056883, -0.12232988],
[-0.11852774, 0.12621534, -0.11203495, ..., 0.11941272,
0.11056883, -0.12232988],
[-0.11852774, 0.12621534, -0.11203495, ..., 0.11941272,
0.11056883, -0.12232988],
...,
[-0.11852774, 0.12621534, -0.11203495, ..., 0.11941272,
0.11056883, -0.12232988],
[-0.11852774, 0.12621534, -0.11203495, ..., 0.11941272,
0.11056883, -0.12232988],
[-0.11852774, 0.12621534, -0.11203495, ..., 0.11941272,
0.11056883, -0.12232988]], dtype=float32)
In [30]: s.shape
Out[30]: (20083, 1024)
In [31]: (s==s).all()
Out[31]: True
When I tried to run the sample command sh run_main.sh ......
Traceback (most recent call last):
File "run_downstream.py", line 8, in <module>
from src.models import model_mapping, load_adam_optimizer_and_scheduler
File "/mnt/SSD2/pmtnet_proj/code/github/OntoProtein/src/models.py", line 16, in <module>
from transformers.deepspeed import is_deepspeed_zero3_enabled
ModuleNotFoundError: No module named 'transformers.deepspeed'
感谢分享非常有趣的论文的代码!请问在这些任务上有和ESM-1b的比较结果吗?另外,请问TAPE上报告的metric都是finetune (supervised)结果而不是unsupervised结果是吗?
Hi,
I am just curious is it possible to get the embedding of proteins and relations in https://www.zjukg.org/project/ProteinKG25/ separately using huggingface ontoprotein model ( https://huggingface.co/zjunlp/OntoProtein)? Let us say,
Protein_1 Enables Protein_2 is triple in the knowledge graph.
Is it to get the embedding of Protein_1, Enables, Protein_2 separately using https://huggingface.co/zjunlp/OntoProtein
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.