Hello, Thank you for this great implementation. I have a usage quest

You can explicitly stop averaging across GPUs by calling: <div class="highlight hi

Independently computing SAM gradient on multiple accelerators about sam HOT 8 CLOSED

evanatyourservice commented on August 16, 2024 2

Independently computing SAM gradient on multiple accelerators

from sam.

Comments (8)

davda54 commented on August 16, 2024 2

You can explicitly stop averaging across GPUs by calling:

with model.no_sync():
    loss.backward()

I hope that helps in your case :)

from sam.

guozhiyao commented on August 16, 2024 2

You can explicitly stop averaging across GPUs by calling:
with model.no_sync():
    loss.backward()
I hope that helps in your case :)

Thanks a lot. But I mean the DDP can average the gradients automatically, so we don't need to do the reduce_gradients_from_all_accelerators(). Is that right?

from sam.

evanatyourservice commented on August 16, 2024 1

I just did some testing and this seems to work and be the same way they do it in the paper.

from sam.

davda54 commented on August 16, 2024

Hi, thank you very much for this great suggestion!

I have not tried distributed training with SAM but your code seems to be in line with the paper. Does it produce better results than reducing the gradients before both steps? I will add a comment about the independent computation into readme — or, if you want, you can make a pull request (as you know more about this important detail than me) :)

from sam.

evanatyourservice commented on August 16, 2024

So I haven't done an exact side-by-side study yet, but it seems to improve things. I looked closely at the authors' jax code and yours and I'm sure it's the exact same as the way they do it, so that's good. I'm pretty busy so probably won't get around to a pull request, but yeah you could add this to the readme, might be useful. Cheers! I'll get around to doing an exact side-by-side test and report back

from sam.

guozhiyao commented on August 16, 2024

I use DDP to parallelize the model training on multiple machines. It seems that DDP will automatically average the gradients on different GPUs during backpropagation.

from sam.

evanatyourservice commented on August 16, 2024

I was using TPUs when I wrote that example

from sam.

yzlnew commented on August 16, 2024

You can explicitly stop averaging across GPUs by calling:
with model.no_sync():
    loss.backward()
I hope that helps in your case :)

Should wrap the whole loss calculation to the context. Otherwise it won't work, afaik.

with model.no_sync():
    loss = criterion(model(inputs), targets)
    loss.backward()

from sam.

Recommend Projects

Independently computing SAM gradient on multiple accelerators about sam HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent