Code Monkey home page Code Monkey logo

Comments (7)

Fragile-azalea avatar Fragile-azalea commented on May 22, 2024

{"max_len_a": 1.2, "max_len_b": 10} means that max_len will differ in different GPUs(please see https://github.com/facebookresearch/fairseq/blob/main/fairseq/sequence_generator.py#L335-L576).
so the alltoall of the largest length is always pending other processes.
One solution is that using {"max_len": 20} instead.
But I don't really understand the effect of this change on the BLEU score.

from tutel.

ghostplant avatar ghostplant commented on May 22, 2024

Thanks for your information. This is a dup of #173.

We'll update the fairseq patch to add inequivalent_tokens=True which is recently in tutel but not in fairseq patch. You may apply it by yourself temporarily.

from tutel.

Fragile-azalea avatar Fragile-azalea commented on May 22, 2024

If I understand correctly, I respectfully disagree with this view. In #173, there are inequivalent tokens in different devices. However, in this case, there is no forward in some devices due to the difference in max_len. (https://github.com/facebookresearch/fairseq/blob/main/fairseq/sequence_generator.py#L335)

from tutel.

ghostplant avatar ghostplant commented on May 22, 2024

That's interesting. If it's true that one GPU performs 5 forwards and another GPU performs 6 forwards, does traditional data parallel even work? I think the application itself has to do something special to avoid that, right?

from tutel.

Fragile-azalea avatar Fragile-azalea commented on May 22, 2024

The number of forwarding may be different in different GPUs during the validation process. In some codes, forwards are even only found on GPU0 during the validation process.

In this code, two sentences with different lengths(e.g, "How are you" and "Thank you") are distributed into two GPUs. BLEU score seems to be calculated word by word. As a result, "How" and "Thank" is viewed as data parallelism, and "are" and "you" is viewed as data parallelism too. However, there is no word that can be viewed as data parallelism with "you" in "How are you".

from tutel.

ghostplant avatar ghostplant commented on May 22, 2024

OK, this root cause makes sense. MoE layer within one process does not know application's purpose on whether other processes are going to forward MoE together with it or not. Even though it knows some of the processes are suspended and not to join in the validation procedure, part of expert parameters stored in those processes would be inaccessible by the validation process.

So, seems like there are no more solutions expect to do changes on application side, which is to open evaluation in every processes. Can you provide the code lines that performs related validation procedure?

from tutel.

Fragile-azalea avatar Fragile-azalea commented on May 22, 2024

Yes, I pretty much agree that opening evaluation in every process can fix this hang. In this code, we delete some BLEU args (i.e. max_len_a, max_len_b) to fix it. So we retrain our transformer by CUDA_VISIBLE_DEVICES=0,1 MOE=3 L_AUX_WT=0.01 SKIP_EXPERT=1 fairseq-train fairseq/data-bin/iwslt14.tokenized.de-en --arch transformer_iwslt_de_en --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr 10e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 4096 --eval-bleu --eval-bleu-args '{"beam": 5}' --eval-bleu-detok moses --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --ddp-backend legacy_ddp --max-update 100000. It seems that the code works well and few BLEU scores will be influenced by deleting these arguments. But I don't sure about it completely.

from tutel.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.