Comments (6)
The nn
backend does not use the GPU. You need to either use the cunn
or cudnn
backend/library to use the GPU.
from imagenet-multigpu.torch.
Hi Soumith,
I have the same problem. I am training Alexnet on 4 GPU. It takes 18 hours per epoch. What is the normal training time per epoch? My training time is much longger than JakeRyan1. Does that caused by the communication between GPUs?
from imagenet-multigpu.torch.
Hi Andy, you can reduce the time per epoch by increasing the number of data loading threads(nDonkeys).
from imagenet-multigpu.torch.
@JakeRyan1 cudnn backend will be much much faster.
@andyhahaha what is the GPU used?
from imagenet-multigpu.torch.
@soumith I use eight Tesla M40
from imagenet-multigpu.torch.
@andyhahaha if you use 8 GPUs of anything, you need to install the GPUs in a PCI-e topology that makes sure that all 8 GPUs map to the same CPU. It's likely that you didn't do this, and hence the huge communication slowdown is likely a bottleneck. (this is not torch-specific, but you'll likely see this on any framework).
There's two things you can do to improve your situation.
- Install NCCL https://github.com/NVIDIA/nccl and nccl.torch https://github.com/ngimel/nccl.torch
- Just use 4-GPUs at a time.
from imagenet-multigpu.torch.
Related Issues (20)
- "nn/Linear.lua:69: input must be vector or matrix"
- Data loading time is different across threads
- Problem with Thread ???
- Retraining an existing model(No details) HOT 1
- How can I retrain this googlenet network? HOT 2
- I used just 2 class training. But test outputs are 4 values. Why?
- Trained googlenet's outputs are negative value. Why negative? HOT 1
- Why not call cutorch.synchronizeAll()
- just want to say "Thanks to Soumith" for this useful toolbox
- how to save my image data into torch format (.t7) HOT 3
- Can not continue training
- Training stops while running main.lua , "nClasses is reported different in the data loader, and in the commandline options" HOT 1
- AlexNet Top-1 accuracy on test dataset only reach 45% when train with multiple GPUs HOT 3
- Dropout in AlexNet
- Resume training fails - weights/gradients/input tensors on different GPUs
- SqueezeNet issue HOT 2
- Could not find any image file in the given input paths
- Training time about Multi GPUs and Single GPU HOT 1
- Data loading
- testing extremely slow
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from imagenet-multigpu.torch.