It is really a great work! I have some questions about copying local gradients. In tra

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

About ensure_shared_grads about pytorch-a3c HOT 9 CLOSED

ikostrikov commented on May 19, 2024

About ensure_shared_grads

from pytorch-a3c.

Comments (9)

hugemicrobe commented on May 19, 2024 9

I wrote a piece of code for testing as follows. print_grad is run on 2 processes, each process add i+1 to the data of the network parameter. Since data is shared, the result would be 2+1+2=5 for both processes. And we can see that the gradient behaves differently: each process has its own gradient initialized to 0, and 0+1=1 0+2=2 are different in the two processes.

If I understand correctly, gradient is allocated separately for each process as mentioned in the following post.

pytorch/examples#138

I think the point is that since grad is still None after we call share_memory(), the gradient allocation inside each process would become separate. One can try to set grad to 0 before calling share_memory(). In this case, the gradient will be shared.

from __future__ import print_function
import os
import torch.multiprocessing as mp
import torch
from torch import nn
from torch.autograd import Variable

os.environ['OMP_NUM_THREADS'] = '1'


def print_grad(shared_model, i):
    for p in shared_model.parameters():
        if p._grad is None:
            p._grad = Variable(torch.FloatTensor([0]))
        p._grad += i+1
        p.data += i+1
        print(p.data)
        print(p.grad)


class TestNet(nn.Module):
    def __init__(self):
        super(TestNet, self).__init__()
        self.x = nn.Parameter(torch.Tensor([2]))

    def forward(self):
        return self.x

model = TestNet()
model.share_memory()

processes = [mp.Process(target=print_grad, args=(model, i)) for i in range(0, 2)]
[p.start() for p in processes]

5
[torch.FloatTensor of size 1]

Variable containing:
1
[torch.FloatTensor of size 1]

5
[torch.FloatTensor of size 1]

Variable containing:
2
[torch.FloatTensor of size 1]

from pytorch-a3c.

xuehy commented on May 19, 2024 2

I guess if shared_param.grad is not None then some other thread must be updating the network. Current thread should not update it until others complete. But I have a question. As I understand, if the grad is not None, the code just returns which means that the gradient of the current thread is just discarded. Is this really the case? So only one of the threads can update the network. The others that complete in the same time will just run in vain?

from pytorch-a3c.

boscotsang commented on May 19, 2024

@xuehy It seems that the shared_param._grad = param.grad makes the shared_param.grad reference the same content as param. Therefore, once the shared_param._grad is not None it always has the same values as param.grad.

from pytorch-a3c.

xuehy commented on May 19, 2024

@boscotsang I am still confusing.

Once the shared_param._grad is not None, it always has the same values as param.grad

But there are many threads owning different param.grad. Assume there are two threads A and B. What if shared_param._grad is assigned with A's param.grad? Then for thread B the shared_param.grad is always not None?

from pytorch-a3c.

hugemicrobe commented on May 19, 2024

@xuehy It seems that grad or _grad is not shared among processes with global_network.share_memory(). Only the weights are shared. Therefore, each process has its own shared_param.grad.

from pytorch-a3c.

xuehy commented on May 19, 2024

@hugemicrobe
The document says,
. Does it mean that shared_param.grad is also shared?

from pytorch-a3c.

SYTMTHU commented on May 19, 2024

@xuehy
I think that the shared_param.grad is shared is exactly why we this function works, otherwise shared_param.grad would always be none. So it seems to me that when a process detected that some other process has copied its local grad to shared_param.grad, it choose to give up its own update, as it directly returns.

What do you think?

from pytorch-a3c.

ikostrikov commented on May 19, 2024

If you are not confident with A3C, I've just made my A2C code public: https://github.com/ikostrikov/pytorch-a2c .

from pytorch-a3c.

xuehy commented on May 19, 2024

@SYTMTHU Yes I can understand how it works. But I think in this way the processes are wasting a lot of time doing nothing. During a same period of time, is the update times of parameters with A3C actually the same as a non-distributed one? Can I make out in this way that the only difference is that the updates of A3C come from different environments while the updates of a non-distributed algorithm come from only one running environment?

from pytorch-a3c.

About ensure_shared_grads about pytorch-a3c HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent