Comments (14)
We have confronted similar problems that using multi-GPU is much slower than single GPU
from imagenet-multigpu.torch.
OK, we find that package contributors have already fixed the problem which is mentioned in this post
from imagenet-multigpu.torch.
hi, sorry about that, yes it should be fixed in the latest commit.
from imagenet-multigpu.torch.
Hi, I tested it and results still showed that “nGPU=4” is slower than “nGPU=1”. Do you have any comment? Thank you.
th main.lua -data ~/imagenet -nGPU 1 -batchSize 128
Epoch: [1][2/10000] Time 0.844 Err 6.9071 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002
Epoch: [1][3/10000] Time 0.845 Err 6.9084 Top1-%: 0.00 LR 1e-02 DataLoadingTime 3.041
Epoch: [1][4/10000] Time 0.845 Err 6.9095 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.051
Epoch: [1][5/10000] Time 0.845 Err 6.9092 Top1-%: 0.00 LR 1e-02 DataLoadingTime 2.557
Epoch: [1][6/10000] Time 0.843 Err 6.9095 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.003th main.lua -data ~/imagenet -nGPU 4 -batchSize 128
Epoch: [1][2/10000] Time 1.781 Err 6.9064 Top1-%: 0.78 LR 1e-02 DataLoadingTime 0.002
Epoch: [1][3/10000] Time 1.765 Err 6.9066 Top1-%: 0.00 LR 1e-02 DataLoadingTime 2.181
Epoch: [1][4/10000] Time 1.761 Err 6.9080 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.004
Epoch: [1][5/10000] Time 1.760 Err 6.9089 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.699
Epoch: [1][6/10000] Time 1.763 Err 6.9058 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.004th main.lua -data ~/imagenet -nGPU 4 -batchSize 256
Epoch: [1][2/10000] Time 2.479 Err 6.9081 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.004
Epoch: [1][3/10000] Time 2.421 Err 6.9074 Top1-%: 0.00 LR 1e-02 DataLoadingTime 3.012
Epoch: [1][4/10000] Time 2.369 Err 6.9066 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.107
Epoch: [1][5/10000] Time 2.368 Err 6.9078 Top1-%: 0.00 LR 1e-02 DataLoadingTime 3.725
Epoch: [1][6/10000] Time 2.368 Err 6.9079 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.005
from imagenet-multigpu.torch.
@chienlinhuang1116 what commit hash are you on?
from imagenet-multigpu.torch.
Thank you soumith, I think it should be the latest commit hash.
(1) You are right the problem had already fixed. When I work on AWS Ubuntu instances with fbcunn libs installation, it is correct and multi-GPU is much faster than single GPU.
(2) Because the latest imagenet examples seems not working with any fbcunn libs, I tested it on CentOS GPU machines which only installing Torch7 and nn packages. In this case, multi-GPU is still slower than single GPU.
from imagenet-multigpu.torch.
in the case of (2), did you install torch freshly? because the latest nn / cunn + the latest commit hash of this repo does not have the slowness anymore.
from imagenet-multigpu.torch.
Hi, I install torch freshly using following steps, but multi-GPU is still slower than single GPU on CentOS GPU machines. Do you have any idea?
Thank you.
curl -s https://raw.githubusercontent.com/torch/ezinstall/master/clean-old.sh | bash
git clone https://github.com/torch/distro.git ~/torch --recursive
cd/torch; ./install.sh/torch bash
curl -s https://raw.githubusercontent.com/torch/ezinstall/master/install-luajit+torch | PREFIX=
from imagenet-multigpu.torch.
Hi
I have the exact same problem.
When I use nGPU=2 it is slower than just one GPU.
I installed the latest version of Torch, cunn, and this repository.
Everything is up to date but still it is slower.
Any Ideas? @soumith @chienlinhuang1116 @buttomnutstoast
Here are my outputs:
for 1 GPU:
==> doing epoch on training data:
==> online epoch # 1
Epoch: [1][1/9000] Time 0.826 Err 3.8455 Top1-%: 3.12 Topn-%: 7.81 LR 1e-02 DataLoadingTime 5.466
Epoch: [1][2/9000] Time 0.709 Err 3.6204 Top1-%: 17.97 Topn-%: 36.72 LR 1e-02 DataLoadingTime 0.072
Epoch: [1][3/9000] Time 0.701 Err 3.1000 Top1-%: 30.47 Topn-%: 65.62 LR 1e-02 DataLoadingTime 0.009
Epoch: [1][4/9000] Time 0.719 Err 2.7807 Top1-%: 27.34 Topn-%: 61.72 LR 1e-02 DataLoadingTime 0.005
Epoch: [1][5/9000] Time 0.687 Err 2.8056 Top1-%: 25.78 Topn-%: 63.28 LR 1e-02 DataLoadingTime 1.749
Epoch: [1][6/9000] Time 0.715 Err 3.1090 Top1-%: 19.53 Topn-%: 55.47 LR 1e-02 DataLoadingTime 0.005
Epoch: [1][7/9000] Time 0.719 Err 2.7177 Top1-%: 25.78 Topn-%: 65.62 LR 1e-02 DataLoadingTime 0.011
Epoch: [1][8/9000] Time 0.676 Err 2.9563 Top1-%: 17.97 Topn-%: 60.16 LR 1e-02 DataLoadingTime 0.010
for two GPUs:
==> doing epoch on training data:
==> online epoch # 1
Epoch: [1][1/9000] Time 4.474 Err 3.8503 Top1-%: 1.56 Topn-%: 9.38 LR 1e-02 DataLoadingTime 6.425
Epoch: [1][2/9000] Time 2.692 Err 3.5693 Top1-%: 23.44 Topn-%: 42.97 LR 1e-02 DataLoadingTime 0.005
Epoch: [1][3/9000] Time 2.539 Err 3.2223 Top1-%: 28.12 Topn-%: 57.81 LR 1e-02 DataLoadingTime 0.022
Epoch: [1][4/9000] Time 2.511 Err 3.0643 Top1-%: 25.78 Topn-%: 57.81 LR 1e-02 DataLoadingTime 0.019
Epoch: [1][5/9000] Time 2.500 Err 2.8987 Top1-%: 31.25 Topn-%: 60.16 LR 1e-02 DataLoadingTime 0.024
Epoch: [1][6/9000] Time 2.497 Err 3.2392 Top1-%: 23.44 Topn-%: 55.47 LR 1e-02 DataLoadingTime 0.020
Epoch: [1][7/9000] Time 2.494 Err 2.8436 Top1-%: 21.88 Topn-%: 63.28 LR 1e-02 DataLoadingTime 0.023
Epoch: [1][8/9000] Time 2.499 Err 2.7006 Top1-%: 22.66 Topn-%: 65.62 LR 1e-02 DataLoadingTime 0.015
Epoch: [1][9/9000] Time 2.493 Err 2.9153 Top1-%: 17.19 Topn-%: 60.16 LR 1e-02 DataLoadingTime 0.021
Epoch: [1][10/9000] Time 2.503 Err 2.7242 Top1-%: 21.09 Topn-%: 62.50 LR 1e-02 DataLoadingTime 0.019
from imagenet-multigpu.torch.
In my experience, you can try:
- use multi-threads in DataParallelTable
- call model:getParameter() before model:forward()
- see if nccl is well installed
- maybe hardware issues...
from imagenet-multigpu.torch.
@buttomnutstoast
Thanks for your answer.
I hadn't NCCL, so I installed it.
The installation seems fine.
Then I replace util.lua:10 by this:
model = nn.DataParallelTable(1):threads(function()
require 'cudnn'
end)
Nothing happened in this case still, it is very slow.
Then I tried to use NCCL by using this:
model = nn.DataParallelTable(1,true,true):threads(function()
require 'cudnn'
end)
But this time the program just stops and does nothing!
any idea?
from imagenet-multigpu.torch.
Did you install C++ source code of NCCL? The module installed from luarocks is simply an interface.
from imagenet-multigpu.torch.
Yes, I installed the C++ source.
I also made using of nccl in two steps, but no difference.
model = nn.DataParallelTable(1,true,true)
model:threads(function()
require 'cudnn'
end)
It just does nothing, it seems that it trapped in a deadlock.
from imagenet-multigpu.torch.
I had an old installation of Torch.
It seems that it was causing the problem.
When I removed it, it works fine now.
Thanks
from imagenet-multigpu.torch.
Related Issues (20)
- "nn/Linear.lua:69: input must be vector or matrix"
- Data loading time is different across threads
- Problem with Thread ???
- Retraining an existing model(No details) HOT 1
- How can I retrain this googlenet network? HOT 2
- I used just 2 class training. But test outputs are 4 values. Why?
- Trained googlenet's outputs are negative value. Why negative? HOT 1
- Why not call cutorch.synchronizeAll()
- just want to say "Thanks to Soumith" for this useful toolbox
- how to save my image data into torch format (.t7) HOT 3
- Can not continue training
- Training stops while running main.lua , "nClasses is reported different in the data loader, and in the commandline options" HOT 1
- AlexNet Top-1 accuracy on test dataset only reach 45% when train with multiple GPUs HOT 3
- Dropout in AlexNet
- Resume training fails - weights/gradients/input tensors on different GPUs
- SqueezeNet issue HOT 2
- Could not find any image file in the given input paths
- Training time about Multi GPUs and Single GPU HOT 1
- Data loading
- testing extremely slow
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from imagenet-multigpu.torch.