Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

This is DistributedDataParallel source code in my pytorch, <a href="https://gist.githu

Performance of Unsupervised NMT with 5M monolingual data about xlm HOT 17 CLOSED

StillKeepTry commented on July 17, 2024

Performance of Unsupervised NMT with 5M monolingual data

from xlm.

Comments (17)

glample commented on July 17, 2024 2

Using the command in the README, that uses less monolingual data, you should get around this:

epoch               ->     7
valid_fr-en_mt_bleu -> 28.36
valid_en-fr_mt_bleu -> 30.50
test_fr-en_mt_bleu  -> 34.02
test_en-fr_mt_bleu  -> 36.62

Which is more than what is reported in the paper. The reason for that is because this data is more "in domain" I guess, as we evaluate on newstest 2014. How many GPU did you use to run the model? The number of GPU matters quite a lot, as it significantly increase the batch size.

from xlm.

StillKeepTry commented on July 17, 2024

It is possibly caused by batch size. I use 4 GPU, each GPU only use 1000 tokens (limited to 12GB). I will increase the batch size for trying again. Thank you for your response.

from xlm.

liujiqiang999 commented on July 17, 2024

@StillKeepTry Maybe you could use accumulate gradient method to solve the OOM problems caused by big batch.

from xlm.

glample commented on July 17, 2024

@JiqiangLiu do you know if there is a way to accumulate gradients in distributed with PyTorch? With one GPU it's easy, but I couldn't find an easy way to do that with more than 1 GPU.

from xlm.

Julisa-test commented on July 17, 2024

@glample Could you share me how to accumulate gradients in distributed with PyTorch with one GPU？

from xlm.

glample commented on July 17, 2024

You would have to hack a bit into the code, but the idea is simply to call the optimizer zero_grad and step only every K iterations if you want to multiply the batch size by K. Before calling the optimizer step you would also need to divide the gradients by K.

With > 1 GPU is it not as straightforward, since PyTorch distributed averages the gradients immediately when loss.backward() is called, so you cannot accumulate them for K steps.

from xlm.

liujiqiang999 commented on July 17, 2024

@glample Hi, today I look at the code of multi-gpu accumulate gradients implemented by fairseq. My view is as follows.
DistributedDataParallel averages over the number of workers: https://pytorch.org/docs/master/nn.html#torch.nn.parallel.DistributedDataParallel.
So we need normalize the gradient by multiplying the number of GPUs, and dividing by the total number of words.

model_parameters = list(model.parameters())
self.multiply_grads(model_parameters, params.n_gpu_per_node / float(sum(words_sum)))
self.optimizers[name].step()

# words_sum is list of int. if words_sum = [100, 200], means that we update parameters every 2 batches. 
# for the CLM, words_sum.append(y.size(0))

We could use normalized gradients to update model parameters.
If accumulating gradients on more than one GPU, we could set model.need_reduction = True in the last batch of all accumulated batches. When the last batch has been calculated, we will reduce.

 if params.n_gpu_per_node > 1:
       if ((freq + 1) % params.update_freq) == 0: 
                model.need_reduction = True
        else:
                model.need_reduction = False

# --update-freq N  : update parameters every N_i batches

Note: loss of every batch should be the sum of word-level loss.

loss = F.cross_entropy(scores, y, reduction='sum')

CLM training code I have finished: clm.txt

I am not sure whether I understand it correctly. If there is any mistake, please give me more advice.

from xlm.

glample commented on July 17, 2024

The gradients are already averaged over words in each process, so simply dividing them by update_freq before they are shared across nodes should do the trick.

What you propose seems the way to go, but what is need_reduction here? If you could prevent PyTorch distributed from doing the reduction that would work, but I'm not sure how to do that.

from xlm.

myleott commented on July 17, 2024

I don't think the Pytorch DistributedDataParallel module supports accumulating gradients. need_reduction is only supported in LegacyDistributedDataParallel: https://github.com/pytorch/fairseq/blob/master/fairseq/legacy_distributed_data_parallel.py. This should be a drop-in replacement for torch's DistributedDataParallel and you can use the need_reduction flag to delay the all-reduce and accumulate gradients.

from xlm.

liujiqiang999 commented on July 17, 2024

This is DistributedDataParallel source code in my pytorch, https://gist.github.com/JiqiangLiu/dc5ed99c32ccb920a68861bba4cd9a31#file-distributed-py-L123 . This seems support delay the all-reduce and accumulate gradients.

from xlm.

myleott commented on July 17, 2024

Ah, right, if you’re using torch 0.4 then it also supports the need_reduction flag.

…

________________________________ From: JiqiangLiu <[email protected]> Sent: Monday, March 4, 2019 8:37 PM To: facebookresearch/XLM Cc: Myle Ott; Comment Subject: Re: [facebookresearch/XLM] Performance of Unsupervised NMT with 5M monolingual data (#26) This is DistributedDataParallel source code in my pytorch, https://gist.github.com/JiqiangLiu/dc5ed99c32ccb920a68861bba4cd9a31#file-distributed-py-L123 . This seems support delay the all-reduce and accumulate gradients. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#26 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAOJdjDj2bBrZ3cm1afrdEMTbpZS5YSbks5vTcosgaJpZM4ba2FS>.

from xlm.

liujiqiang999 commented on July 17, 2024

You mean that we can use the DistributedDataParallel to implement the accumulating gradient when PyTorch0.4, but for the new version PyTorch, should we use LegacyDistributedDataParallel? By the way,
When happens to the all-reduce operation?

from xlm.

jiahuigeng commented on July 17, 2024

The gradients are already averaged over words in each process, so simply dividing them by update_freq before they are shared across nodes should do the trick.

What you propose seems the way to go, but what is need_reduction here? If you could prevent PyTorch distributed from doing the reduction that would work, but I'm not sure how to do that.

Using the command in the README, that uses less monolingual data, you should get around this:
epoch               ->     7
valid_fr-en_mt_bleu -> 28.36
valid_en-fr_mt_bleu -> 30.50
test_fr-en_mt_bleu  -> 34.02
test_en-fr_mt_bleu  -> 36.62
Which is more than what is reported in the paper. The reason for that is because this data is more "in domain" I guess, as we evaluate on newstest 2014. How many GPU did you use to run the model? The number of GPU matters quite a lot, as it significantly increase the batch size.

If possible, could you add the feature for accumulate the gradients?

from xlm.

glample commented on July 17, 2024

@jiahuigeng this is planned. We will try to add this feature.

from xlm.

jiahuigeng commented on July 17, 2024

This is DistributedDataParallel source code in my pytorch, https://gist.github.com/JiqiangLiu/dc5ed99c32ccb920a68861bba4cd9a31#file-distributed-py-L123 . This seems support delay the all-reduce and accumulate gradients.

How is your results of this accumulated gradients updating? I have implemented one but the bleu results is not as good as assumed

from xlm.

liujiqiang999 commented on July 17, 2024

@jiahuigeng Sorry, I have not met the similar situation before. Maybe you can output the gradient at each step and check.

from xlm.

glample commented on July 17, 2024

Closing this issue for inactivity. Implementation with gradient accumulation will be available soon.

from xlm.

Performance of Unsupervised NMT with 5M monolingual data about xlm HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent