Comments (17)
Using the command in the README, that uses less monolingual data, you should get around this:
epoch -> 7
valid_fr-en_mt_bleu -> 28.36
valid_en-fr_mt_bleu -> 30.50
test_fr-en_mt_bleu -> 34.02
test_en-fr_mt_bleu -> 36.62
Which is more than what is reported in the paper. The reason for that is because this data is more "in domain" I guess, as we evaluate on newstest 2014. How many GPU did you use to run the model? The number of GPU matters quite a lot, as it significantly increase the batch size.
from xlm.
It is possibly caused by batch size. I use 4 GPU, each GPU only use 1000 tokens (limited to 12GB). I will increase the batch size for trying again. Thank you for your response.
from xlm.
@StillKeepTry Maybe you could use accumulate gradient method to solve the OOM problems caused by big batch.
from xlm.
@JiqiangLiu do you know if there is a way to accumulate gradients in distributed with PyTorch? With one GPU it's easy, but I couldn't find an easy way to do that with more than 1 GPU.
from xlm.
@glample Could you share me how to accumulate gradients in distributed with PyTorch with one GPU?
from xlm.
You would have to hack a bit into the code, but the idea is simply to call the optimizer zero_grad
and step
only every K iterations if you want to multiply the batch size by K. Before calling the optimizer step
you would also need to divide the gradients by K.
With > 1 GPU is it not as straightforward, since PyTorch distributed averages the gradients immediately when loss.backward()
is called, so you cannot accumulate them for K steps.
from xlm.
@glample Hi, today I look at the code of multi-gpu accumulate gradients implemented by fairseq. My view is as follows.
DistributedDataParallel averages over the number of workers: https://pytorch.org/docs/master/nn.html#torch.nn.parallel.DistributedDataParallel.
So we need normalize the gradient by multiplying the number of GPUs, and dividing by the total number of words.
model_parameters = list(model.parameters())
self.multiply_grads(model_parameters, params.n_gpu_per_node / float(sum(words_sum)))
self.optimizers[name].step()
# words_sum is list of int. if words_sum = [100, 200], means that we update parameters every 2 batches.
# for the CLM, words_sum.append(y.size(0))
We could use normalized gradients to update model parameters.
If accumulating gradients on more than one GPU, we could set model.need_reduction = True
in the last batch of all accumulated batches. When the last batch has been calculated, we will reduce.
if params.n_gpu_per_node > 1:
if ((freq + 1) % params.update_freq) == 0:
model.need_reduction = True
else:
model.need_reduction = False
# --update-freq N : update parameters every N_i batches
Note: loss of every batch should be the sum of word-level loss.
loss = F.cross_entropy(scores, y, reduction='sum')
CLM training code I have finished: clm.txt
I am not sure whether I understand it correctly. If there is any mistake, please give me more advice.
from xlm.
The gradients are already averaged over words in each process, so simply dividing them by update_freq
before they are shared across nodes should do the trick.
What you propose seems the way to go, but what is need_reduction
here? If you could prevent PyTorch distributed from doing the reduction that would work, but I'm not sure how to do that.
from xlm.
I don't think the Pytorch DistributedDataParallel module supports accumulating gradients. need_reduction
is only supported in LegacyDistributedDataParallel: https://github.com/pytorch/fairseq/blob/master/fairseq/legacy_distributed_data_parallel.py. This should be a drop-in replacement for torch's DistributedDataParallel and you can use the need_reduction
flag to delay the all-reduce and accumulate gradients.
from xlm.
This is DistributedDataParallel source code in my pytorch, https://gist.github.com/JiqiangLiu/dc5ed99c32ccb920a68861bba4cd9a31#file-distributed-py-L123 . This seems support delay the all-reduce and accumulate gradients.
from xlm.
from xlm.
You mean that we can use the DistributedDataParallel to implement the accumulating gradient when PyTorch0.4, but for the new version PyTorch, should we use LegacyDistributedDataParallel? By the way,
When happens to the all-reduce operation?
from xlm.
The gradients are already averaged over words in each process, so simply dividing them by
update_freq
before they are shared across nodes should do the trick.What you propose seems the way to go, but what is
need_reduction
here? If you could prevent PyTorch distributed from doing the reduction that would work, but I'm not sure how to do that.
Using the command in the README, that uses less monolingual data, you should get around this:
epoch -> 7 valid_fr-en_mt_bleu -> 28.36 valid_en-fr_mt_bleu -> 30.50 test_fr-en_mt_bleu -> 34.02 test_en-fr_mt_bleu -> 36.62
Which is more than what is reported in the paper. The reason for that is because this data is more "in domain" I guess, as we evaluate on newstest 2014. How many GPU did you use to run the model? The number of GPU matters quite a lot, as it significantly increase the batch size.
If possible, could you add the feature for accumulate the gradients?
from xlm.
@jiahuigeng this is planned. We will try to add this feature.
from xlm.
This is DistributedDataParallel source code in my pytorch, https://gist.github.com/JiqiangLiu/dc5ed99c32ccb920a68861bba4cd9a31#file-distributed-py-L123 . This seems support delay the all-reduce and accumulate gradients.
How is your results of this accumulated gradients updating? I have implemented one but the bleu results is not as good as assumed
from xlm.
@jiahuigeng Sorry, I have not met the similar situation before. Maybe you can output the gradient at each step and check.
from xlm.
Closing this issue for inactivity. Implementation with gradient accumulation will be available soon.
from xlm.
Related Issues (20)
- Add memory to transformer
- XLM LICENSE
- Error when using the uploaded en-fr model for NMT (translate from English to French) HOT 1
- Error in Training HOT 3
- Generate multiple optimal results(beam search)
- Training data details for XLM-15 model HOT 1
- Question about parameters for further training of a preexisiting model?
- default params for PKM
- supervised machine translation HOT 1
- How is sentence piece model trained in XLM-R?
- [Question] Does XLM-R follows RoBERTa or XLM for MLM?
- ./get-data-para.sh HOT 3
- Checkpoint for TLM objective
- confusion about `lm_head`'s size? HOT 2
- How can I expand it to a new language which is Romanised? For example, Marathi Romanized?
- e
- get-data-glue.sh 400 Bad Request
- bt_steps meaning
- how to save the entire model instead of just the model parameters
- Predict a masked word
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from xlm.