Comments (12)
@compete369 For BytePS, can you please try export MXNET_OMP_MAX_THREADS=10
for the servers?
from byteps.
There are a few things you can try. If any of the following works for you, please let us know. Though the following env may start with MXNET
, they apply to any workers, TF/MXNet/PyTorch, because the parameter server is based on MXNet.
-
For the parameter servers, set
export MXNET_OMP_MAX_THREADS=10
if you have 16 CPU cores per server. Setexport MXNET_OMP_MAX_THREADS=4
if you only have 8 CPU cores -
Set
export MXNET_CPU_WORKER_NTHREADS=32
. This may speed up the parameter server -
Start more parameter server instances. For example, when you have two physical machines to run the servers, you can start 4 (
DMLC_NUM_SERVER=4
), two server instances per physical machine. This will increase the network bandwidth utilization especially when your single TCP flow cannot saturate your bandwidth.
from byteps.
crazy: CPU goes very high, about 100%, but the GPU utilization goes down to 2-10%. Thanks very much for your instruction.
from byteps.
@compete369 That sounds like your cpu becomes the bottleneck. Perhaps you can reduce MXNET_CPU_WORKER_NTHREADS
to 16 or even smaller value. It requires some tuning.
from byteps.
Hello, I followed your advice except "export MXNET_CPU_WORKER_NTHREADS=32", and got total 4605 imgs/sec with 8 GPUs + 64Core+256GB mem * 2 workers, 16core + 16GB mem *4 servers. Thanks very much!
if including "export MXNET_CPU_WORKER_NTHREADS=32", the servers are going crazy, then I dropped it.
2 quick questions:
- when the test goes to the end, there is an exception
Traceback (most recent call last):
File "/usr/local/byteps/example/pytorch/benchmark_byteps.py", line 132, in
raise Exception
Traceback (most recent call last):
File "/usr/local/byteps/example/pytorch/benchmark_byteps.py", line 132, in
Exception
raise Exception
Exception
Traceback (most recent call last):
File "/usr/local/byteps/example/pytorch/benchmark_byteps.py", line 132, in
raise Exception
Traceback (most recent call last):
File "/usr/local/byteps/example/pytorch/benchmark_byteps.py", line 132, in
Exception raise Exception
Exception
Traceback (most recent call last):
File "/usr/local/byteps/example/pytorch/benchmark_byteps.py", line 132, in
raise Exception
Exception
Traceback (most recent call last):
File "/usr/local/byteps/example/pytorch/benchmark_byteps.py", line 132, in
raise Exception
Exception
Traceback (most recent call last):
File "/usr/local/byteps/example/pytorch/benchmark_byteps.py", line 132, in
raise Exception
Exception
Traceback (most recent call last):
Iter #99: 289.7 img/sec per GPU
File "/usr/local/byteps/example/pytorch/benchmark_byteps.py", line 132, in
raise Exception
- do you know how to analysis the NCCL Ring? I am wondering whether the ring takes use of NVlink correctly?
worker-pytorch-0:45:45 [7] NCCL INFO Ring 00 : 3[7] -> 0[4] via P2P/IPC
worker-pytorch-0:42:42 [5] NCCL INFO Ring 00 : 1[5] -> 2[6] via P2P/IPC
worker-pytorch-0:46:46 [6] NCCL INFO Ring 00 : 2[6] -> 3[7] via P2P/IPC
worker-pytorch-0:41:41 [4] NCCL INFO Ring 00 : 0[4] -> 1[5] via P2P/IPC
worker-pytorch-0:41:41 [4] NCCL INFO Ring 01 : 0[4] -> 2[6] via P2P/IPC
worker-pytorch-0:42:42 [5] NCCL INFO Ring 01 : 1[5] -> 3[7] via P2P/IPC
worker-pytorch-0:45:45 [7] NCCL INFO Ring 01 : 3[7] -> 0[4] via P2P/IPC
worker-pytorch-0:46:46 [6] NCCL INFO Ring 01 : 2[6] -> 1[5] via P2P/IPC
worker-pytorch-0:41:41 [4] NCCL INFO Ring 02 : 0[4] -> 3[7] via P2P/IPC
worker-pytorch-0:46:46 [6] NCCL INFO Ring 02 : 2[6] -> 0[4] via P2P/IPC
worker-pytorch-0:45:45 [7] NCCL INFO Ring 02 : 3[7] -> 1[5] via P2P/IPC
worker-pytorch-0:42:42 [5] NCCL INFO Ring 02 : 1[5] -> 2[6] via P2P/IPC
worker-pytorch-0:46:46 [6] NCCL INFO Ring 03 : 2[6] -> 1[5] via P2P/IPC
worker-pytorch-0:42:42 [5] NCCL INFO Ring 03 : 1[5] -> 0[4] via P2P/IPC
worker-pytorch-0:41:41 [4] NCCL INFO Ring 03 : 0[4] -> 3[7] via P2P/IPC
worker-pytorch-0:45:45 [7] NCCL INFO Ring 03 : 3[7] -> 2[6] via P2P/IPC
worker-pytorch-0:45:45 [7] NCCL INFO Ring 04 : 3[7] -> 0[4] via P2P/IPC
worker-pytorch-0:42:42 [5] NCCL INFO Ring 04 : 1[5] -> 2[6] via P2P/IPC
worker-pytorch-0:41:41 [4] NCCL INFO Ring 04 : 0[4] -> 1[5] via P2P/IPC
worker-pytorch-0:46:46 [6] NCCL INFO Ring 04 : 2[6] -> 3[7] via P2P/IPC
worker-pytorch-0:46:46 [6] NCCL INFO Ring 05 : 2[6] -> 1[5] via P2P/IPC
worker-pytorch-0:41:41 [4] NCCL INFO Ring 05 : 0[4] -> 2[6] via P2P/IPC
worker-pytorch-0:45:45 [7] NCCL INFO Ring 05 : 3[7] -> 0[4] via P2P/IPC
worker-pytorch-0:42:42 [5] NCCL INFO Ring 05 : 1[5] -> 3[7] via P2P/IPC
worker-pytorch-0:42:42 [5] NCCL INFO Ring 06 : 1[5] -> 2[6] via P2P/IPC
worker-pytorch-0:41:41 [4] NCCL INFO Ring 06 : 0[4] -> 3[7] via P2P/IPC
worker-pytorch-0:45:45 [7] NCCL INFO Ring 06 : 3[7] -> 1[5] via P2P/IPC
worker-pytorch-0:46:46 [6] NCCL INFO Ring 06 : 2[6] -> 0[4] via P2P/IPC
worker-pytorch-0:46:46 [6] NCCL INFO Ring 07 : 2[6] -> 1[5] via P2P/IPC
worker-pytorch-0:45:45 [7] NCCL INFO Ring 07 : 3[7] -> 2[6] via P2P/IPC
worker-pytorch-0:42:42 [5] NCCL INFO Ring 07 : 1[5] -> 0[4] via P2P/IPC
worker-pytorch-0:41:41 [4] NCCL INFO Ring 07 : 0[4] -> 3[7] via P2P/IPC
from byteps.
@compete369 Good to know you get performance improvement.
-
We fixed the exception in b825042. The code in our docker images are stale though. We will update the images. For now you can manually update the code and rebuild.
-
Perhaps take a look at
nvidia-smi nvlink -sc
. This might be helpful.
if including "export MXNET_CPU_WORKER_NTHREADS=32", the servers are going crazy, then I dropped it.
Besides, what does "crazy" mean? Does it mean bad performance?
from byteps.
Doesn't really help me much
from byteps.
from byteps.
@spgeaney113 @Bama4542 If you have specific questions, please open new issues. You are only spamming this thread now.
from byteps.
How did you test the performance report on the main page? synthetic or real imagenet on the NAS. I tested the horovod, with 32 GPUs, the performance dropped 20%(8300 -> 6477)
from byteps.
@compete369 We used synthetic data in the performance report.
from byteps.
Could share which public cloud you relied on if possible? Just curious about the good network stableness and performance. Thanks!
from byteps.
Related Issues (20)
- How to use gradient accumulate in BytePS torch DDP? HOT 5
- The byteps in K8S Pod doesn't have DMLC_WORKER_ID configured.
- Stuck in the bps.init(). HOT 7
- Is it right to do allreduce immediately for non-zero ranks in bytescheduler? HOT 2
- 啥时候支持sparse模型?
- 有计划支持纯cpu吗?我们worker也用cpu机器的 HOT 2
- benchmark with cross barrier error
- Successfully installed BytePS but cannot import byteps.torch or byteps.tensorflow HOT 2
- Running multiple workers on a single GPU machine
- Release BytePS docker image support for TF2
- 安装报错 HOT 1
- Communication failure in MXNet with BytePS HOT 3
- support for fault tolerance and straggler mitigation
- broadcast and is_initialized api are not supported with pytorch.
- Supported environment
- 安装问题
- Mistakes of Workload calculation HOT 5
- How does the tensorflow scheduler plugin used in the tf_benchmark_cnn.py HOT 1
- segmentation fault while launching the worker HOT 1
- Is there any benchmark comparison with Megatron-LM ?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from byteps.