Comments (7)
{"max_len_a": 1.2, "max_len_b": 10} means that max_len will differ in different GPUs(please see https://github.com/facebookresearch/fairseq/blob/main/fairseq/sequence_generator.py#L335-L576).
so the alltoall of the largest length is always pending other processes.
One solution is that using {"max_len": 20} instead.
But I don't really understand the effect of this change on the BLEU score.
from tutel.
Thanks for your information. This is a dup of #173.
We'll update the fairseq patch to add inequivalent_tokens=True
which is recently in tutel but not in fairseq patch. You may apply it by yourself temporarily.
from tutel.
If I understand correctly, I respectfully disagree with this view. In #173, there are inequivalent tokens in different devices. However, in this case, there is no forward in some devices due to the difference in max_len. (https://github.com/facebookresearch/fairseq/blob/main/fairseq/sequence_generator.py#L335)
from tutel.
That's interesting. If it's true that one GPU performs 5 forwards and another GPU performs 6 forwards, does traditional data parallel even work? I think the application itself has to do something special to avoid that, right?
from tutel.
The number of forwarding may be different in different GPUs during the validation process. In some codes, forwards are even only found on GPU0 during the validation process.
In this code, two sentences with different lengths(e.g, "How are you" and "Thank you") are distributed into two GPUs. BLEU score seems to be calculated word by word. As a result, "How" and "Thank" is viewed as data parallelism, and "are" and "you" is viewed as data parallelism too. However, there is no word that can be viewed as data parallelism with "you" in "How are you".
from tutel.
OK, this root cause makes sense. MoE layer within one process does not know application's purpose on whether other processes are going to forward MoE together with it or not. Even though it knows some of the processes are suspended and not to join in the validation procedure, part of expert parameters stored in those processes would be inaccessible by the validation process.
So, seems like there are no more solutions expect to do changes on application side, which is to open evaluation in every processes. Can you provide the code lines that performs related validation procedure?
from tutel.
Yes, I pretty much agree that opening evaluation in every process can fix this hang. In this code, we delete some BLEU args (i.e. max_len_a, max_len_b) to fix it. So we retrain our transformer by CUDA_VISIBLE_DEVICES=0,1 MOE=3 L_AUX_WT=0.01 SKIP_EXPERT=1 fairseq-train fairseq/data-bin/iwslt14.tokenized.de-en --arch transformer_iwslt_de_en --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr 10e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 4096 --eval-bleu --eval-bleu-args '{"beam": 5}' --eval-bleu-detok moses --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --ddp-backend legacy_ddp --max-update 100000
. It seems that the code works well and few BLEU scores will be influenced by deleting these arguments. But I don't sure about it completely.
from tutel.
Related Issues (20)
- Error when doing deepcopy of the model HOT 5
- Example on saving experts to one model when using distributed training HOT 2
- Pretrained MoE model HOT 2
- Cannot import JIT optimized kernels. Did you forget to install Custom Kernel Extension? HOT 1
- NCCL Asynchronous update timeout crash with Tutel MoE HOT 5
- New Tutel checkpoint loading is incompatible with old models HOT 7
- Multi-nodes training is much more slower than single node HOT 1
- [installation errors] fatal error: nccl.h: No such file or directory HOT 1
- RuntimeError: No such operator tutel_ops::cumsum HOT 10
- How the experts' gradients are handled under data parallelism? HOT 1
- All2All precision always in fp32 HOT 1
- tutel/jit_kernels/sparse.py torch.float16 There is a bug in the calculation: the cuda calculation result is inconsistent with the CPU calculation result and the array is out of bounds HOT 1
- [Bug]The function func_fwd is calculated inconsistent on the cpu and gpu HOT 1
- ImportError: cannot import name 'tutel_custom_kernel' from 'tutel.impls.jit_compiler' HOT 12
- about compute_location and locations HOT 1
- INTERNAL ASSERT FAILED HOT 5
- Training with Data and Expert Parallelism HOT 5
- Can this package support the one-gpu machine HOT 5
- how to use tutel on Megatron Deepspeed HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tutel.