Describe the bug Hi, Authors. My code seems to hang when skip_rem

Thanks for your information. This is a dup of <a class="issue-link js-issue-link" data

My code seems to hang when skip_remainder_batch=False. about tutel HOT 7 OPEN

microsoft commented on May 22, 2024

My code seems to hang when skip_remainder_batch=False.

from tutel.

Comments (7)

Fragile-azalea commented on May 22, 2024

{"max_len_a": 1.2, "max_len_b": 10} means that max_len will differ in different GPUs(please see https://github.com/facebookresearch/fairseq/blob/main/fairseq/sequence_generator.py#L335-L576).
so the alltoall of the largest length is always pending other processes.
One solution is that using {"max_len": 20} instead.
But I don't really understand the effect of this change on the BLEU score.

from tutel.

ghostplant commented on May 22, 2024

Thanks for your information. This is a dup of #173.

We'll update the fairseq patch to add inequivalent_tokens=True which is recently in tutel but not in fairseq patch. You may apply it by yourself temporarily.

from tutel.

Fragile-azalea commented on May 22, 2024

If I understand correctly, I respectfully disagree with this view. In #173, there are inequivalent tokens in different devices. However, in this case, there is no forward in some devices due to the difference in max_len. (https://github.com/facebookresearch/fairseq/blob/main/fairseq/sequence_generator.py#L335)

from tutel.

ghostplant commented on May 22, 2024

That's interesting. If it's true that one GPU performs 5 forwards and another GPU performs 6 forwards, does traditional data parallel even work? I think the application itself has to do something special to avoid that, right?

from tutel.

Fragile-azalea commented on May 22, 2024

The number of forwarding may be different in different GPUs during the validation process. In some codes, forwards are even only found on GPU0 during the validation process.

In this code, two sentences with different lengths(e.g, "How are you" and "Thank you") are distributed into two GPUs. BLEU score seems to be calculated word by word. As a result, "How" and "Thank" is viewed as data parallelism, and "are" and "you" is viewed as data parallelism too. However, there is no word that can be viewed as data parallelism with "you" in "How are you".

from tutel.

ghostplant commented on May 22, 2024

OK, this root cause makes sense. MoE layer within one process does not know application's purpose on whether other processes are going to forward MoE together with it or not. Even though it knows some of the processes are suspended and not to join in the validation procedure, part of expert parameters stored in those processes would be inaccessible by the validation process.

So, seems like there are no more solutions expect to do changes on application side, which is to open evaluation in every processes. Can you provide the code lines that performs related validation procedure?

from tutel.

Fragile-azalea commented on May 22, 2024

Yes, I pretty much agree that opening evaluation in every process can fix this hang. In this code, we delete some BLEU args (i.e. max_len_a, max_len_b) to fix it. So we retrain our transformer by CUDA_VISIBLE_DEVICES=0,1 MOE=3 L_AUX_WT=0.01 SKIP_EXPERT=1 fairseq-train fairseq/data-bin/iwslt14.tokenized.de-en --arch transformer_iwslt_de_en --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr 10e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 4096 --eval-bleu --eval-bleu-args '{"beam": 5}' --eval-bleu-detok moses --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --ddp-backend legacy_ddp --max-update 100000. It seems that the code works well and few BLEU scores will be influenced by deleting these arguments. But I don't sure about it completely.

from tutel.

My code seems to hang when skip_remainder_batch=False. about tutel HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent