Hi, I'm getting this error when trying to train my own model on mult

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Getting "RuntimeError: CUDA error: out of memory" when trying to train on multiple GPUs about blobgan HOT 5 CLOSED

dave-epstein commented on September 16, 2024

Getting "RuntimeError: CUDA error: out of memory" when trying to train on multiple GPUs

from blobgan.

Comments (5)

dave-epstein commented on September 16, 2024

It isn't clear to me what the batch size you're using is from what you posted since 24 is commented out, but 2080Ti only has around 20GB of memory, if I'm not mistaken. Try batch size 4 or 8 with gradient accumulation.

from blobgan.

yimingsu01 commented on September 16, 2024

Sorry I was being cryptic, I didn't specify the batch size and it should be 22 from the log. Each 2080Ti has 11GB so 4 of them should give me 44GB memory. How can I enable gradient accumulation?

from blobgan.

dave-epstein commented on September 16, 2024

That batch size is much too large for 11GB per GPU. GPU memory is not pooled in DDP training. Try 4 or 8. You can set trainer.accumulate_grad_batches to 2 or 4 for stability.

from blobgan.

yimingsu01 commented on September 16, 2024

[2022-09-20 10:27:24,717][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 3
[2022-09-20 10:27:24,730][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2022-09-20 10:27:24,731][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

[2022-09-20 10:27:24,733][torch.distributed.distributed_c10d][INFO] - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[2022-09-20 10:27:24,734][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[2022-09-20 10:27:24,740][torch.distributed.distributed_c10d][INFO] - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.

For some reason the terminal got stuck here and seems to be frozen.

from blobgan.

dave-epstein commented on September 16, 2024

Sorry, I'm not sure how I can help debug this without more information :) This is likely to be a GPU memory or other system issue.

from blobgan.

Getting "RuntimeError: CUDA error: out of memory" when trying to train on multiple GPUs about blobgan HOT 5 CLOSED

Comments (5)

Related Issues (12)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent