Hey Soumith, I am training Alexnet on a single GPU(with NN backend). A single epoc

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Too much time per epoch about imagenet-multigpu.torch HOT 6 CLOSED

soumith commented on August 11, 2024

Too much time per epoch

from imagenet-multigpu.torch.

Comments (6)

williford commented on August 11, 2024

The nn backend does not use the GPU. You need to either use the cunn or cudnn backend/library to use the GPU.

from imagenet-multigpu.torch.

andyhahaha commented on August 11, 2024

Hi Soumith,
I have the same problem. I am training Alexnet on 4 GPU. It takes 18 hours per epoch. What is the normal training time per epoch? My training time is much longger than JakeRyan1. Does that caused by the communication between GPUs?

from imagenet-multigpu.torch.

Viresh-R commented on August 11, 2024

Hi Andy, you can reduce the time per epoch by increasing the number of data loading threads(nDonkeys).

from imagenet-multigpu.torch.

soumith commented on August 11, 2024

@JakeRyan1 cudnn backend will be much much faster.
@andyhahaha what is the GPU used?

from imagenet-multigpu.torch.

andyhahaha commented on August 11, 2024

@soumith I use eight Tesla M40

from imagenet-multigpu.torch.

soumith commented on August 11, 2024

@andyhahaha if you use 8 GPUs of anything, you need to install the GPUs in a PCI-e topology that makes sure that all 8 GPUs map to the same CPU. It's likely that you didn't do this, and hence the huge communication slowdown is likely a bottleneck. (this is not torch-specific, but you'll likely see this on any framework).
There's two things you can do to improve your situation.

Install NCCL https://github.com/NVIDIA/nccl and nccl.torch https://github.com/ngimel/nccl.torch
Just use 4-GPUs at a time.

from imagenet-multigpu.torch.

Recommend Projects

Too much time per epoch about imagenet-multigpu.torch HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent