xlang-ai / unifiedskg Goto Github PK

[EMNLP 2022] Unifying and multi-tasking structured knowledge grounding with language models

Home Page: https://arxiv.org/abs/2201.05966

License: Apache License 2.0

Dockerfile 0.16% Python 99.84%

natural-language-processing pytorch huggingface-transformers huggingface-datasets structured-knowledge-grounding semantic-parsing question-answering data-to-text text-generation fact-verification

unifiedskg's People

Contributors

Stargazers

Watchers

Forkers

yyht taoyds elementai zenmoore blankcheng misery0424 zsc19 jboru chuanglee dragomirradev bettyhczhang ralphhan jinsu-l sythello shashank-shet baylee001 chatc yql210 puraminy kochsnow seanahmad wangjunxiao nolongernome cnxupupup yianzhang sweetcard mingkin chacha-chen xingbow fgvdeput yushuinanrong kaitorecca gengruotong shiyanlou-015555 22842219 mbrukman taokz zhenlei96 y-sui fblissjr ronghuiju davin05 xiaoguo1992 rpatil524 hongbangyuan lastdayboy ffengill rekriz11 jfontestad pbnewron lhrlab steven-yiran tosemml siyue-zhang emceea yaooxu

unifiedskg's Issues

prefix tuning with t5-3b

I am trying to run prefix tuning with t5-3b, but I got some strange error

  File "/home/ubuntu/anaconda3/envs/py3.7pytorch1.8new/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/code/UnifiedSKG/models/prompt/modeling_t5.py", line 486, in forward
    key_states = torch.cat([prefix["prev_key"], key_states], dim=2)
RuntimeError: Sizes of tensors must match except in dimension 3. Got 128 and 32 (The offending index is 0)

This error does not take place for t5-base or t5-large, only got this for t5-3b. Any tips?
Also I am having OOM issue with t5-3b model, it crashed even in case of mini-batch size = 1 and running on a 40GB GPU. Does anyone have the same issue? Thanks.

Knowledge Graph as Input: question-specific subgraphs

Hi, very exciting work!

I have a question on how you create the question-specific subgraphs when using Knowledge Graphs as input (i.e., ComplexWebQ). By navigating in compwebq/test.jsonl, I see that the maximum number of triplets used over all questions is 61 and that at least one answer lies within the subgraphs in 2725/2816 (96.8%) test questions.

Do you use specific mechanisms to prune irrelevant facts and how you make sure to contain the answers?

Thanks a lot!

[Deprecated] Separate setting: OverflowError: cannot fit 'int' into an index-sized integer

in fetaqa config file if I change concatenate to separate and run prefix tuning the following error occurs

file train.py", line 185, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/transformers/trainer.py", line 1260, in train
    for step, inputs in enumerate(epoch_iterator):
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/pouramini/UnifiedSKG/utils/dataset.py", line 116, in __getitem__
    max_length=self.tokenizer.model_max_length,
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2406, in __call__
    **kwargs,
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2476, in encode_plus
    **kwargs,
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 480, in _encode_plus
    verbose=verbose,
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2913, in prepare_for_model
    return_attention_mask=return_attention_mask,
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2731, in pad
    return_attention_mask=return_attention_mask,
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 3065, in _pad
    encoded_inputs["attention_mask"] = [1] * len(required_input) + [0] * difference
OverflowError: cannot fit 'int' into an index-sized integer

Help with reproducing T5-3b number on Spider

Hi,

I'm trying to reproduce the Table2 ST number with T5-3B on Spider.
I'm using the following command on 16 A100 GPUs:

deepspeed train.py --deepspeed deepspeed/ds_config_zero2.json --seed 2 --cfg Salesforce/T5_3b_finetune_spider_with_cell_value.cfg --run_name T5_3b_finetune_spider --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 250 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 250 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 8 --num_train_epochs 80 --adafactor false --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_3b_finetune_spider --overwrite_output_dir --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --generation_num_beams 4 --generation_max_length 128 --input_max_length 1024 --ddp_find_unused_parameters true

Does it look right? I get 68.83 using this command. Could you help me with the command that can reproduce 71.76 on Spider? Thanks!

NQ-Tables test set evaluation

Hi,

What is the best SKG-trained model to evaluate the natural questions tables (NQ-Tables) test set? Is it "from_all_T5_large_prefix_wikitq2"?

Thanks,

Yasser

Model checkpoints and outputs for T5-3b

Hi, thank you for the cool work!

I am a student at the University of Glasgow looking to reproduce your results for tabular question answering.
The largest checkpoints available to download are T5 large yet to make the best analysis we require T5 3B.

May you upload these to the hugging face model hub please?

Alternatively, could you provide model predictions for all datasets and model variants for test and dev sets since then the models themselves will not need to be run.

Thanks!

How can I specify PLM folder

I have problem in downloading or caching the PLM due to my connection and blocked websites.

I want to give the folder of PLM which was already downloaded. How can I set it in ymal?
Thanks

huggingface model download failed

Hi, thanks for sharing this exciting work!

I'm having trouble downloading the model in huggingface

when I download the tokenizer of hkunlp/from_all_T5_base_prefix_grailqa2, got error

tokenizer = AutoTokenizer.from_pretrained("hkunlp/from_all_T5_base_prefix_grailqa2")

Traceback (most recent call last):
  File "<input>", line 4, in <module>
  File "/home2/xh/.conda/envs/skg/lib/python3.6/site-packages/transformers/models/auto/tokenization_auto.py", line 416, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1705, in from_pretrained
    resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
  File "/home2/xh/.conda/envs/skg/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1776, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.6/site-packages/transformers/models/t5/tokenization_t5_fast.py", line 136, in __init__
    **kwargs,
  File "/home2/xh/.conda/envs/skg/lib/python3.6/site-packages/transformers/tokenization_utils_fast.py", line 87, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: Permission denied (os error 13)

Same problem happens on others like from_all_T5_base_prefix_compwebq2, but download the model of them works fine.

looking forward to your reply, thx

How can I train the multi-task models?

Hi, thanks for the great project and I am quite interested in it.

I briefly checked the code repo and the training process but I didn't find the right configuration to train the unified (multi-task) model. Any pointers or suggestions for this?

Can I use the project for a text to text task?

I am interested in using prefix-tuning with t5 implemented in this project, but for a text to text project. (text generation). Could you guide me which files should I follow or how to modify them to add such a task. Or if I just want to use the prefixtuning part of the models in my own project, how can I call the model.

(question on paper) knowledge source used for WQSP?

Hi, I have a question on your experimenetal setting.
(I apologize that it's not a direct coding/engineering issue, but I couldn't find e-mail addresses in your EMNLP paper)

The figure in your paper suggests that Freebase is used as KG, but I couldn't find explicit mention about the exact type of KG used for WQSP dataset.

Could you specify which KG is used for WQSP (for example, Freebase or WikiData)?

the original prefix-tuning corresponds to the models in the directory `models/prompt`?

Dear authors,

Thank you so much for the effort and opensourcing the well-maintained and clean codebase! And I also appreciate your detailed explanations on Zhihu.

I just have a quick specific question, are the scripts inside models/prompt the re-implementation of the prefix tuning paper (Li & Liang)?

Thank you so much!

How to use fp16

Hi,
I find the deepspeed directory in the project. And I want to know how to train with fp16? Now, I just add the --fp16 in the command, but it seems not to work.

TypeError: add_code_sample_docstrings() got an unexpected keyword argument 'tokenizer_class

When I tried your command for wikiQA t5 prefix, I got the following error:

File "/home/pouramini/UnifiedSKG/models/unified/prefixtuning.py", line 8, in <module>
    from ..prompt.modeling_auto import AutoModelForSeq2SeqLM
  File "/home/pouramini/UnifiedSKG/models/prompt/modeling_auto.py", line 34, in <module>
    from .modeling_bart import (
  File "/home/pouramini/UnifiedSKG/models/prompt/modeling_bart.py", line 1162, in <module>
    class BartModel(BartPretrainedModel):
  File "/home/pouramini/UnifiedSKG/models/prompt/modeling_bart.py", line 1189, in BartModel
    @add_code_sample_docstrings(
TypeError: add_code_sample_docstrings() got an unexpected keyword argument 'tokenizer_class

Questions about MultiWOZ and SMD (KVRET)

Thank you for your awesome work!

I have two questions about structured knowledge processing on MultiWOZ and SMD (KVRET) datasets:

For MultiWOZ dataset, what is ontology_values for non-categorical slots (e.g. name, time)

https://github.com/HKUNLP/UnifiedSKG/blob/65157f72d259c88d14603dd33ce747124e286f33/seq2seq_construction/multiwoz.py#L87-L88

For SMD (KVRET) dataset, the whole KB (without any explicit / hidden row selection) is fed into linearized structured knowledge, right?

BLEC metric for SQL-to-Text

Dear Authors,

Thanks for releasing this code.

I have a question regarding the BLEC metric for SQL-to-Text:

Does BLEC utilize a gold reference? Or is it just based on the predicted text and the input SQL?
https://github.com/HKUNLP/UnifiedSKG/blob/0d47cd9382f0a3d17c70701056f399706ee3b698/third_party/BLEC/Spider/spider.py#L22

Thanks

Can not reproduce the results of spider with T5-base

Hi, thanks for sharing the great work.
But I have some problems when reproducing the results of T5-base on Spider dataset with prefix tuning. I use the following command with 8 gpus:
python -m torch.distributed.launch --nproc_per_node 8 --master_port 1234
train.py
--seed 2
--cfg Salesforce/T5_base_prefix_spider_with_cell_value.cfg
--run_name T5_base_prefix_spider
--logging_strategy steps
--logging_first_step true
--logging_steps 4
--evaluation_strategy steps
--eval_steps 500
--metric_for_best_model avr
--greater_is_better true
--save_strategy steps
--save_steps 500
--save_total_limit 1
--gradient_accumulation_steps 8
--num_train_epochs 400
--save_total_limit 1
--adafactor true
--learning_rate 5e-5
--load_best_model_at_end
--do_train
--do_eval
--do_predict
--predict_with_generate
--output_dir output/T5_base_prefix_spider
--overwrite_output_dir
--per_device_train_batch_size 1
--per_device_eval_batch_size 4
--generation_num_beams 1
--generation_max_length 128
--input_max_length 1024
--ddp_find_unused_parameters true

We get the results of 52.32 for T5-base on Spider, and 59.57 for T5-large. They are different from the results reported in the paper. Can you help me with this problem? Thank you.

Regarding tuning the prefixes

@Timothyxxx Hi, I have few queries regarding prefix tuning. Can you shed your opinions on the same since you have worked extensively on it (the github repo of Prefix tuning seems abandoned).

Will training the whole model along with prefixes yield a better performance (assuming its not low data regime - greater than 1k datapoints)?
Assume we train the prefixes in two cases: freezing the model and not freezing the model. Will the learned representation of prefixes for both the cases be the same or different?
If I train the model+prefixes in two scenarios: (i) train the model and prefixes together (finetune - single shot) and, (ii) finetune the model only first, freeze it, and then train the prefixes only with the frozen weights (two stages). Is there any difference between both of them (I think there is) and would like to know about why the diff?

[Bug] When running without WandB, the `wandb_run_dir` is None

When running without WandB, the self.wandb_run_dir is None. It will raise a FileNotFoundError.

https://github.com/HKUNLP/UnifiedSKG/blob/ecba08f7c81b2472c4046286e38e5f9ffe896d84/utils/trainer.py#L343-L353

https://github.com/HKUNLP/UnifiedSKG/blob/ecba08f7c81b2472c4046286e38e5f9ffe896d84/train.py#L133-L145

Hyperparameters for all the datasets

It seems that you only report the detailed hyperparameters for WikiTQ. Where can we find the hyperparameters for other datasets that can reproduce the results?

Result verification - could you please give a pointer to the model used to obtain SPIDER results in the paper?

Is there a unified entrance to reproduce all paper experiments?

like scripts folder:

scripts/train_spider.sh
scripts/train_wikitq.sh
...

Or, for all datasets, the command to run the experiment is exactly the same as README Training

Help wanted! What is the purpose of adapter tuning in the code? Why does the adapter need to introduce a certain number of virtual tokens?

I found differences between the paper and code of Unified-SKG. The paper does not mention any experiments or results related to adapter tuning, but I discovered adapter tuning code in the codebase. Moreover, this adapter tuning code requires carrying 10 virtual tokens for tuning. Why is that?

Question on Prefix tuning code

Hi, I am looking at prefix tuning code..I have few queries on the implementation.

what exactly are the variables in these lines? I understand that prefix tuning provides input to every layer of the encoder-decoder model....But my understanding is that there should be a single wte and control_trans; not sure what the variables in the highlighted lines do.
I dont understand why the *2 in this line of code?
What does the control_trans variable mean in the code? what is its function?
Also, I see another variable mid_dim. What is it conceptually?

Thank you

Checkpoints for models trained on Spider

Dear Authors,

Thanks for putting together this great resource.

I wish to know if we have checkpoints released for T5 models finetuned on the Spider dataset.
i.e. the checkpoints corresponding to the Spider row in Table 2.

The Spider checkpoints currently released on huggingface correspond only to prefix-tuning (and not finetuning) experiments. Am I right?

Also, what config files should one use to reproduce Spider row in Table 2?

E.g. is T5_large_finetune_spider_with_cell_value.cfg the file reproduce numbers for T5 large?

Thank you :)

About prefix tuning

When doing the Multi-task prefix-tuning, in my understanding the T5 is not finetuned on any SKG task(or not finetuned at all?). Would it be better finetune it first and then prefix tuning it again? Just curious whether you have done some experiment about this.

Really wonderful work!

Error on colab

Hi,
On the colab import essential package section, I got this error:

AttributeError Traceback (most recent call last)
in <cell line: 4>()
2 import time
3 import torch
----> 4 import datasets
5 from transformers import (
6 HfArgumentParser,

6 frames
/usr/lib/python3.10/functools.py in update_wrapper(wrapper, wrapped, assigned, updated)
59 # Issue #17482: set wrapped last so we don't inadvertently copy it
60 # from the wrapped function when updating dict
---> 61 wrapper.wrapped = wrapped
62 # Return the wrapper so this can be used as a decorator via partial()
63 return wrapper

AttributeError: readonly attribute

Is this because I got this error when I do pip install?
Building wheels for collected packages: tokenizers, sacremoses
error: subprocess-exited-with-error

× Building wheel for tokenizers (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
Building wheel for tokenizers (pyproject.toml) ... error
ERROR: Failed building wheel for tokenizers
Building wheel for sacremoses (setup.py) ... done
Created wheel for sacremoses: filename=sacremoses-0.0.53-py3-none-any.whl size=895241 sha256=da4bd6abf554d38d1aef9a04f43337bfe477ff8dc20ca937e3b0820ca1903c6e
Stored in directory: /root/.cache/pip/wheels/00/24/97/a2ea5324f36bc626e1ea0267f33db6aa80d157ee977e9e42fb
Successfully built sacremoses
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

The different results between eval mode and test mode.

Why I get the different results between eval mode and test mode?

The saved model folder do not contain config.json.

When I try to eval the model using my saved checkpoint, it points to "'t5-model' is the correct path to a directory containing a config.json file".
I realize the saved model folder does not contain config.json.
How can I figure it out?

NCCL version

i have installed environment in the yaml file and installed torch 1.8 follow the setting in readme

my cuda version is 11.4, it seems that it is a version conflict of NCCL, pytorch and cuda

Is my cuda version to high?

ssh://[email protected]:22/home2/xh/.conda/envs/skg/bin/python -u -m torch.distributed.launch --nproc_per_node 4 --master_port 1234 train.py --seed 2 --cfg Salesforce/T5_base_prefix_compwebq.cfg --run_name T5_base_prefix_compwebq --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 2 --num_train_epochs 400 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_base_prefix_compwebq --overwrite_output_dir --per_device_train_batch_size 2 --per_device_eval_batch_size 4 --generation_num_beams 4 --generation_max_length 128 --input_max_length 1024 --ddp_find_unused_parameters true
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/__init__.py:422: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  "torch.set_deterministic is deprecated and will be removed in a future "
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
INFO:filelock:Lock 140211123887800 acquired on .lock
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/__init__.py:422: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  "torch.set_deterministic is deprecated and will be removed in a future "
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/__init__.py:422: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  "torch.set_deterministic is deprecated and will be removed in a future "
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/__init__.py:422: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  "torch.set_deterministic is deprecated and will be removed in a future "
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
INFO:filelock:Lock 140211123887800 released on .lock
INFO:filelock:Lock 140144150953768 acquired on .lock
INFO:filelock:Lock 140144150953768 released on .lock
INFO:filelock:Lock 139898741587640 acquired on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
INFO:filelock:Lock 139898741587640 released on .lock
INFO:filelock:Lock 139711655354096 acquired on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
INFO:filelock:Lock 139711655354096 released on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Killing subprocess 30316
Killing subprocess 30320
Killing subprocess 30321
Killing subprocess 30322
Traceback (most recent call last):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home2/xh/.conda/envs/skg/bin/python', '-u', 'train.py', '--local_rank=3', '--seed', '2', '--cfg', 'Salesforce/T5_base_prefix_compwebq.cfg', '--run_name', 'T5_base_prefix_compwebq', '--logging_strategy', 'steps', '--logging_first_step', 'true', '--logging_steps', '4', '--evaluation_strategy', 'steps', '--eval_steps', '500', '--metric_for_best_model', 'avr', '--greater_is_better', 'true', '--save_strategy', 'steps', '--save_steps', '500', '--save_total_limit', '1', '--load_best_model_at_end', '--gradient_accumulation_steps', '2', '--num_train_epochs', '400', '--adafactor', 'true', '--learning_rate', '5e-5', '--do_train', '--do_eval', '--do_predict', '--predict_with_generate', '--output_dir', 'output/T5_base_prefix_compwebq', '--overwrite_output_dir', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '4', '--generation_num_beams', '4', '--generation_max_length', '128', '--input_max_length', '1024', '--ddp_find_unused_parameters', 'true']' returned non-zero exit status 1.

Process finished with exit code 1

I also tried torch 1.11+ cu113，got another error


(skg) xh@4210GPU:~/PycharmProject/UnifiedSKG$ python -m torch.distributed.launch --nproc_per_node 4 --master_port 1234 train.py --seed 2 --cfg Salesforce/T5_base_finetune_compwebq.cfg --run_name T5_base_finetune_compwebq --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 2 --num_train_epochs 400 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_base_finetune_compwebq --overwrite_output_dir --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --generation_num_beams 4 --generation_max_length 128 --input_max_length 1024 --ddp_find_unused_parameters true
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 29, in main
    torch.set_deterministic(True)
AttributeError: module 'torch' has no attribute 'set_deterministic'
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 29, in main
    torch.set_deterministic(True)
AttributeError: module 'torch' has no attribute 'set_deterministic'
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 29, in main
    torch.set_deterministic(True)
AttributeError: module 'torch' has no attribute 'set_deterministic'
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 29, in main
    torch.set_deterministic(True)
AttributeError: module 'torch' has no attribute 'set_deterministic'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 17913) of binary: /home2/xh/.conda/envs/skg/bin/python
Traceback (most recent call last):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-14_21:36:50
  host      : 4210GPU
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 17914)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2022-03-14_21:36:50
  host      : 4210GPU
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 17915)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2022-03-14_21:36:50
  host      : 4210GPU
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 17916)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-14_21:36:50
  host      : 4210GPU
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 17913)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Looking forward to your reply.
Thank you.

GPU and batch size setting

I am using the training command in readme
python -m torch.distributed.launch --nproc_per_node 4 --master_port 1234 train.py --seed 2 --cfg Salesforce/T5_base_finetune_wikitq.cfg --run_name T5_base_finetune_wikitq --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 8 --num_train_epochs 400 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_base_finetune_wikitq --overwrite_output_dir --per_device_train_batch_size 4 --per_device_eval_batch_size 16 --generation_num_beams 4 --generation_max_length 128 --input_max_length 1024 --ddp_find_unused_parameters true
my question is how to set GPU and batch size, it said this command is 4 GPU and 128 batch size, but i didn't see it in this command, neither in the code
Thx

WikiSQL model producing only one token as output

I tried to go off the collab demo and loaded the wikiSQL tokenizer and model, when I tried running it, the output just spit out one token and I have no idea what's wrong.

[Deprecated] Size mismatch when setting knowledge_usage to separate

For the evaluation, I receive the following error:

...
  File "/home/pouramini/UnifiedSKG/utils/trainer.py", line 298, in prediction_step
    **gen_kwargs,
  File "/home/pouramini/UnifiedSKG/models/unified/prefixtuning.py", line 289, in generate
    bsz=bsz, sample_size=kwargs['num_beams'], description=description_representation, knowledge=knowledge_representation,
  File "/home/pouramini/UnifiedSKG/models/unified/prefixtuning.py", line 120, in get_prompt
    past_key_values = torch.cat([past_key_values, self.knowledge_trans(knowledge)], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 0. Got 4 and 16 (The offending index is 0)

It seems when you use num_beam greater than 1 and it's used as sample_size in get_prompt method and is multiplied to batch_size, the mismatch occurs.

    def get_prompt(self, bsz=None, sample_size=1, description=None, knowledge=None):
        old_bsz = bsz
        bsz = bsz * sample_size
        input_tokens = self.input_tokens.unsqueeze(0).expand(bsz, -1)
        temp_control = self.wte(input_tokens)
        if description is not None:
            temp_control = temp_control + description.repeat_interleave(sample_size, dim=0).unsqueeze(1)
        past_key_values = self.control_trans(temp_control)  # bsz, seqlen, layer*emb
        if knowledge is not None:
            past_key_values = torch.cat([past_key_values, self.knowledge_trans(knowledge)], dim=1)

About training time

Hi, first of all, thanks for your brilliant work! I try to add a new task into UnifiedSKG(similar with kvret, but much bigger) using prefix tuning for t5-based. However, the training time is so long. I use one GPU3090, bsz=4. One epoch takes about 2.5 hours, could you give me any advice? In table 18 in UnifiedSKG, "Preﬁx-tuning needs more steps to converge and converges to comparable performances". Does this mean that prefix-tuning needs more training time to achieve the same effect as finetune?
Thanks a lot! Lucy

Getting the multi-task prefix weights

I downloaded the hkunlp/from_all_T5_large_prefix_sql2text2 model but did not find the prefix weights used inside the model.

Where can I get the prefix weights obtained from multi-task prefix tuning and also the finetuned prefix weights on individual tasks?

TypeError: get_hash() missing 1 required positional argument: 'use_fast_tokenizer'

I generated some predictions for fetaqa task. Then I tried to run its evaluator on the output results (predictions_predict.json). I get the following error for bertscore metric.

File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/datasets/metric.py", line 402, in compute
    output = self._compute(predictions=predictions, references=references, **kwargs)
  File "/home/pouramini/.cache/huggingface/modules/datasets_modules/metrics/bertscore/ad3c901af4d4f2c58af08db5c6aae5af2a3668edcf7c8f3fa9419c22807ad482/bertscore.py", line 147, in _compute
    use_custom_baseline=baseline_path is not None,
TypeError: get_hash() missing 1 required positional argument: 'use_fast_tokenizer'

There could be a mismatch between datasets metrics and bertscore, however I tried to install the versions in yaml.

CUDA OOM with DeepSpeed stage 3 for 11B T5 model

I wonder if you guys have tried training the T5 11B param model on a single node with 8GPUs for the single task full finetuning case? I have not been able to get past the CUDA OOM issue with this repo codebase even with setting per device batch size to 1 for training and eval with p4d.24xlarge machine having 8 GPUs.

DDP training default

Does the train.py use the DDP training default, even if the n_gpus = 1?

Help wanted! confused about the prefix_spider_with_cell_value.cfg in the T5-base configuration and T5-Large configuration.

On UnifiedSKG/configure/Salesforce/T5_base_prefix_spider_with_cell_value.cfg
It shows the use_description=True and concatenate_description=True

But, on UnifiedSKG/configure/Salesforce/T5_large_prefix_spider_with_cell_value.cfg
It show the use_description=False and concatenate_description=False

Why are the settings different for different model bases when it comes to the same fine-tuning method? Did you write "t5-large" incorrectly?

xlang-ai / unifiedskg Goto Github PK

unifiedskg's People

Contributors

Stargazers

Watchers

Forkers

unifiedskg's Issues

Hi, On the colab import essential package section, I got this error:

Recommend Projects

Recommend Topics

Recommend Org

Hi,
On the colab import essential package section, I got this error: