Code Monkey home page Code Monkey logo

Comments (17)

glample avatar glample commented on July 17, 2024 2

Using the command in the README, that uses less monolingual data, you should get around this:

epoch               ->     7
valid_fr-en_mt_bleu -> 28.36
valid_en-fr_mt_bleu -> 30.50
test_fr-en_mt_bleu  -> 34.02
test_en-fr_mt_bleu  -> 36.62

Which is more than what is reported in the paper. The reason for that is because this data is more "in domain" I guess, as we evaluate on newstest 2014. How many GPU did you use to run the model? The number of GPU matters quite a lot, as it significantly increase the batch size.

from xlm.

StillKeepTry avatar StillKeepTry commented on July 17, 2024

It is possibly caused by batch size. I use 4 GPU, each GPU only use 1000 tokens (limited to 12GB). I will increase the batch size for trying again. Thank you for your response.

from xlm.

liujiqiang999 avatar liujiqiang999 commented on July 17, 2024

@StillKeepTry Maybe you could use accumulate gradient method to solve the OOM problems caused by big batch.

from xlm.

glample avatar glample commented on July 17, 2024

@JiqiangLiu do you know if there is a way to accumulate gradients in distributed with PyTorch? With one GPU it's easy, but I couldn't find an easy way to do that with more than 1 GPU.

from xlm.

Julisa-test avatar Julisa-test commented on July 17, 2024

@glample Could you share me how to accumulate gradients in distributed with PyTorch with one GPU?

from xlm.

glample avatar glample commented on July 17, 2024

You would have to hack a bit into the code, but the idea is simply to call the optimizer zero_grad and step only every K iterations if you want to multiply the batch size by K. Before calling the optimizer step you would also need to divide the gradients by K.

With > 1 GPU is it not as straightforward, since PyTorch distributed averages the gradients immediately when loss.backward() is called, so you cannot accumulate them for K steps.

from xlm.

liujiqiang999 avatar liujiqiang999 commented on July 17, 2024

@glample Hi, today I look at the code of multi-gpu accumulate gradients implemented by fairseq. My view is as follows.
DistributedDataParallel averages over the number of workers: https://pytorch.org/docs/master/nn.html#torch.nn.parallel.DistributedDataParallel.
So we need normalize the gradient by multiplying the number of GPUs, and dividing by the total number of words.

model_parameters = list(model.parameters())
self.multiply_grads(model_parameters, params.n_gpu_per_node / float(sum(words_sum)))
self.optimizers[name].step()

# words_sum is list of int. if words_sum = [100, 200], means that we update parameters every 2 batches. 
# for the CLM, words_sum.append(y.size(0)) 

We could use normalized gradients to update model parameters.
If accumulating gradients on more than one GPU, we could set model.need_reduction = True in the last batch of all accumulated batches. When the last batch has been calculated, we will reduce.

 if params.n_gpu_per_node > 1:
       if ((freq + 1) % params.update_freq) == 0: 
                model.need_reduction = True
        else:
                model.need_reduction = False

# --update-freq N  : update parameters every N_i batches

Note: loss of every batch should be the sum of word-level loss.

loss = F.cross_entropy(scores, y, reduction='sum')

CLM training code I have finished: clm.txt

I am not sure whether I understand it correctly. If there is any mistake, please give me more advice.

from xlm.

glample avatar glample commented on July 17, 2024

The gradients are already averaged over words in each process, so simply dividing them by update_freq before they are shared across nodes should do the trick.

What you propose seems the way to go, but what is need_reduction here? If you could prevent PyTorch distributed from doing the reduction that would work, but I'm not sure how to do that.

from xlm.

myleott avatar myleott commented on July 17, 2024

I don't think the Pytorch DistributedDataParallel module supports accumulating gradients. need_reduction is only supported in LegacyDistributedDataParallel: https://github.com/pytorch/fairseq/blob/master/fairseq/legacy_distributed_data_parallel.py. This should be a drop-in replacement for torch's DistributedDataParallel and you can use the need_reduction flag to delay the all-reduce and accumulate gradients.

from xlm.

liujiqiang999 avatar liujiqiang999 commented on July 17, 2024

This is DistributedDataParallel source code in my pytorch, https://gist.github.com/JiqiangLiu/dc5ed99c32ccb920a68861bba4cd9a31#file-distributed-py-L123 . This seems support delay the all-reduce and accumulate gradients.

from xlm.

myleott avatar myleott commented on July 17, 2024

from xlm.

liujiqiang999 avatar liujiqiang999 commented on July 17, 2024

You mean that we can use the DistributedDataParallel to implement the accumulating gradient when PyTorch0.4, but for the new version PyTorch, should we use LegacyDistributedDataParallel? By the way,
When happens to the all-reduce operation?

from xlm.

jiahuigeng avatar jiahuigeng commented on July 17, 2024

The gradients are already averaged over words in each process, so simply dividing them by update_freq before they are shared across nodes should do the trick.

What you propose seems the way to go, but what is need_reduction here? If you could prevent PyTorch distributed from doing the reduction that would work, but I'm not sure how to do that.

Using the command in the README, that uses less monolingual data, you should get around this:

epoch               ->     7
valid_fr-en_mt_bleu -> 28.36
valid_en-fr_mt_bleu -> 30.50
test_fr-en_mt_bleu  -> 34.02
test_en-fr_mt_bleu  -> 36.62

Which is more than what is reported in the paper. The reason for that is because this data is more "in domain" I guess, as we evaluate on newstest 2014. How many GPU did you use to run the model? The number of GPU matters quite a lot, as it significantly increase the batch size.

If possible, could you add the feature for accumulate the gradients?

from xlm.

glample avatar glample commented on July 17, 2024

@jiahuigeng this is planned. We will try to add this feature.

from xlm.

jiahuigeng avatar jiahuigeng commented on July 17, 2024

This is DistributedDataParallel source code in my pytorch, https://gist.github.com/JiqiangLiu/dc5ed99c32ccb920a68861bba4cd9a31#file-distributed-py-L123 . This seems support delay the all-reduce and accumulate gradients.

How is your results of this accumulated gradients updating? I have implemented one but the bleu results is not as good as assumed

from xlm.

liujiqiang999 avatar liujiqiang999 commented on July 17, 2024

@jiahuigeng Sorry, I have not met the similar situation before. Maybe you can output the gradient at each step and check.

from xlm.

glample avatar glample commented on July 17, 2024

Closing this issue for inactivity. Implementation with gradient accumulation will be available soon.

from xlm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.