Doing the finetuning and keeping seeing the follow assertion errors

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

AssertionErrors,about patil-suraj/question_generation

yanghoonkim commented on May 30, 2024 2

I found that if I let CUDA_VISIBLE_DEVICES='0', which means using only one GPU, the code works.

The error rises from trainer.py:
#Our model outputs do not work with DataPrallel, so forcing return tuple.
if isinstance(model, nn.DataParallel):
inputs["return_tuple"] = True

These lines of codes make error when the model forwards data

result = self.forward(*input, **kwargs)

from question_generation.

judywxy commented on May 30, 2024

When the device is CPU, then this assertion error disappears.

from question_generation.

patil-suraj commented on May 30, 2024

Hi @judywxy , what is your transformers version,
It runs fine with version 3.0.0

try pip install -U transformers==3.0.0

from question_generation.

judywxy commented on May 30, 2024

@patil-suraj Oh, Thanks a lot.
I have the following installed:

tokenizers 0.8.1rc1
torch 1.6.0+cu101
torchfile 0.1.0
torchvision 0.7.0+cu101
tornado 6.0.3
tqdm 4.32.1
traitlets 4.3.2
transformers 3.0.2
typing 3.6.4
urllib3 1.24.2
visdom 0.1.8.8
wandb 0.9.5

So, you mean change transformers from 3.0.2 to 3.0.0 ?

from question_generation.

patil-suraj commented on May 30, 2024

Yes, I havey tried it with 3.0.2 yet

from question_generation.

judywxy commented on May 30, 2024

@patil-suraj Thanks for prompt reply. I will try with 3.0.0.
By the way, I trained a t5-small single task qg with transfermers' trainer
08/24/2020 06:24:24 - INFO - qgtrain - ***** Eval results *****
08/24/2020 06:24:24 - INFO - qgtrain - epoch = 9.999269539810081
08/24/2020 06:24:24 - INFO - qgtrain - eval_loss = 1.6273178581207517
It looks nice as the following three metrics show
BLEU_4 | METEOR | ROUGE_L
0.189037 | 0.252798 | 0.406141
Slightly better than the published counterpart model

So, I want to train a multi-task one like t5-multi and want to change the following config.
Besides changing the train_file_path, valid_file_path, output_dir, shall I also change the model_name_or_path from t5-small to t5-base? what about the tokenizer_name_or_path?

args = {
"model_name_or_path": "t5-small",
"model_type": "t5",
"tokenizer_name_or_path": "t5_qg_tokenizer",
"output_dir": "../QG_models03/t5-small-qg-hl",
"train_file_path": "../QG_data/train_data_qg_highlight_qg_format_t5.pt",
"valid_file_path": "../QG_data/valid_data_qg_highlight_qg_format_t5.pt",
"qg_format": "highlight_qg_format",
"per_device_train_batch_size": 32,
"per_device_eval_batch_size": 24,
"gradient_accumulation_steps": 8,
"learning_rate": 1e-4,
"num_train_epochs": 12,
"no_cuda": True,
"seed": 1, # Default 42
"do_train": True,
"do_eval": True,
"evaluate_during_training": True,
"logging_steps" :100
}

from question_generation.

judywxy commented on May 30, 2024

@patil-suraj
after changing transformers from 3.0.2 to 3.0.0 and setting "no_cuda": False, The Assertion errors appear again!

tokenizers 0.8.0rc4
torch 1.6.0+cu101
torchfile 0.1.0
torchvision 0.7.0+cu101
tornado 6.0.3
tqdm 4.32.1
traitlets 4.3.2
transformers 3.0.0
typing 3.6.4
urllib3 1.24.2
visdom 0.1.8.8
wandb 0.9.5

from question_generation.

judywxy commented on May 30, 2024

@patil-suraj

The Assertion error is related to the following code in trainer's Trainer class

    # Our model outputs do not work with DataParallel, so forcing return tuple.
    if isinstance(model, nn.DataParallel):
        inputs["return_tuple"] = True

from question_generation.

patil-suraj commented on May 30, 2024

Hey @judywxy ,
could check again if your version is correct, because that change was added in Trainer after 3.0.0.

You can see the Trainer at v3.0.0 here

from question_generation.

judywxy commented on May 30, 2024

@patil-suraj
here are what installed. The transformers is in version 3.0.0
tokenizers 0.8.0rc4
torch 1.6.0+cu101
torchfile 0.1.0
torchvision 0.7.0+cu101
tornado 6.0.3
tqdm 4.32.1
traitlets 4.3.2
transformers 3.0.0
typing 3.6.4
urllib3 1.24.2
visdom 0.1.8.8
wandb 0.9.5

from question_generation.

varshith321 commented on May 30, 2024

Is this issue resolved?. I am still facing issues while training.

from question_generation.

daisylab commented on May 30, 2024

I've got the same problem recently. My workaround was using only one GPU.

CUDA_VISIBLE_DEVICES=0 python run_qg.py \
    --model_name_or_path t5-small \
    --model_type t5 \
    --tokenizer_name_or_path t5_qg_tokenizer \
    --output_dir t5-small-qg-hl \
    --train_file_path data/train_data_qg_hl_t5.pt \
    --valid_file_path data/valid_data_qg_hl_t5.pt \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 32 \
    --gradient_accumulation_steps 8 \
    --learning_rate 1e-4 \
    --num_train_epochs 10 \
    --seed 42 \
    --do_train \
    --do_eval \
    --evaluate_during_training \
    --logging_steps 100

from question_generation.

AssertionErrors about question_generation HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent