netbench / gpcnet Goto Github PK

View Code? Open in Web Editor NEW

35.0 35.0 17.0 47 KB

License: Other

Makefile 0.53% C 99.47%

gpcnet's People

Contributors

Stargazers

Watchers

Forkers

fanrizhao justsz mendygral ljq07 jychoi-hpc neuralvis liuyh2 manjugv christopheredsall kadircs sudheerchunduri yqin insanum robertbollig miharulidze

gpcnet's Issues

Linking results in multiple definition errors

The build of GPCNET results in multiple definitions of the symbols table_outerbar, table_innerbar, and print_buffer:

$ make clean
rm -f *.o
rm -f network_test
rm -f network_load_test
$ make
cc -c -o network_test.o network_test.c -I . 
cc -c -o random_ring.o random_ring.c -I . 
cc -c -o collectives.o collectives.c -I . 
cc -c -o subcomms.o subcomms.c -I . 
cc -c -o utils.o utils.c -I . 
cc -o network_test utils.o random_ring.o collectives.o subcomms.o network_test.o -I .  -lm
/usr/bin/ld: random_ring.o:(.bss+0x0): multiple definition of `table_outerbar'; utils.o:(.bss+0x0): first defined here
/usr/bin/ld: random_ring.o:(.bss+0x60): multiple definition of `table_innerbar'; utils.o:(.bss+0x60): first defined here
/usr/bin/ld: random_ring.o:(.bss+0xc0): multiple definition of `print_buffer'; utils.o:(.bss+0xc0): first defined here
...

I suggest the following changes:

$ diff network_test.h.orig network_test.h
34c34
< char table_outerbar[TBLSIZE+1], table_innerbar[TBLSIZE+1], print_buffer[TBLSIZE+1];
---
> extern char table_outerbar[TBLSIZE+1], table_innerbar[TBLSIZE+1], print_buffer[TBLSIZE+1];
$ diff utils.c.orig utils.c
21a22,23
> char table_outerbar[TBLSIZE+1], table_innerbar[TBLSIZE+1], print_buffer[TBLSIZE+1];
> 
$ make clean; make
rm -f *.o
rm -f network_test
rm -f network_load_test
cc -c -o network_test.o network_test.c -I . 
cc -c -o random_ring.o random_ring.c -I . 
cc -c -o collectives.o collectives.c -I . 
cc -c -o subcomms.o subcomms.c -I . 
cc -c -o utils.o utils.c -I . 
cc -o network_test utils.o random_ring.o collectives.o subcomms.o network_test.o -I .  -lm
$

trying to understand gpcnet output

Hello,

I got the table from below after running network_test.

I have two questions:

What is the meaning of Avg(Worst) column
How is possible to have for Multiple Allreduce the 99% and 99.9% percentile values outside the min-max range?

Kind regards,

Lucian Anton

Network Tests v1.3
  Test with 14320 MPI ranks (1790 nodes)

  Legend
   RR = random ring communication pattern
   Nat = natural ring communication pattern
   Lat = latency
   BW = bandwidth
   BW+Sync = bandwidth with barrier
+------------------------------------------------------------------------------------------------------------------------------------------+
|                                                          Isolated Network Tests                                                          |
+---------------------------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
|                            Name |          Min |          Max |          Avg |   Avg(Worst) |          99% |        99.9% |        Units |
+---------------------------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
|          RR Two-sided Lat (8 B) |          1.2 |         22.2 |          1.5 |          4.7 |          3.6 |          5.1 |         usec |
+---------------------------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
|                RR Get Lat (8 B) |          1.3 |         22.3 |          1.9 |          3.7 |          2.2 |          3.6 |         usec |
+---------------------------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
|      RR Two-sided BW (131072 B) |        549.7 |       3015.1 |       1199.2 |        764.5 |        460.4 |        335.0 |   MiB/s/rank |
+---------------------------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
|            RR Put BW (131072 B) |          7.4 |      22134.8 |       2598.8 |          7.4 |          0.9 |          0.9 |   MiB/s/rank |
+---------------------------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| RR Two-sided BW+Sync (131072 B) |        336.2 |       2031.9 |        916.5 |        769.7 |        335.5 |        186.9 |   MiB/s/rank |
+---------------------------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
|     Nat Two-sided BW (131072 B) |        650.0 |       4913.7 |       1899.5 |       1124.1 |       1142.5 |        883.4 |   MiB/s/rank |
+---------------------------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
|        Multiple Allreduce (8 B) |         37.3 |         78.3 |         45.5 |         78.3 |        113.3 |        999.9 |         usec |
+---------------------------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
|      Multiple Alltoall (4096 B) |        838.9 |       1003.9 |        901.6 |        838.9 |        479.3 |        186.3 |   MiB/s/rank |
+---------------------------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+

Running gpcnet on ARM with EFA

Hello all,

We are working to run gpcnet on the new ARM offerings for AWS, using EFA, and are getting Segmetation Faults on the congester portion of the tests. First question.....have the benchmarks been successfully compiled/run on ARM by anyone that they can share lessons learned if they have any? Second, any tips on where to start when looking into the Seg Faults for just the congestion portion?

Thanks!

Set minimum time for benchmark

I tried to do this by setting latency iterations, but malloc fails for larger (> 10,000,000) latency iterations.

Failed to allocate perf_vals in random_ring()

Enable new mode in GPCNeT for congestor only

Sometimes users would like to just run congestors without first running canaries.

GPCNeT results on small 32 node cluster

Dear colleagues,

Thank you for very interesting article about GPCNeT presented at SC19!
In my opinion GPCNeT looks like a good attempt to fill the existing gap in congestion control studies of HPC networking.

I'm not sure is the GitHub is the right place for asking questions regarding GPCNeT, but why not to?

Since congestion control is also a point of my personal research interests, I decided to evaluate the GPCNeT on a typical small cluster system with 32 nodes:

Intel Xeons, 18 cores @ 2.30 GHz
ConnectX-4 EDR, 100 Gb/s
36 ports Mellanox SB7700, 7.2Tb/s of backplane bandwidth
OpenMPI 4.0.3 + UCX

I ran the network_load_test using 28 of 32 nodes with in several scenarious and got the results presented below:

20%vs80% proportion of canaries and congestors, 4 congestors, default msg sizes - no congestion (here I refer to congestion impact metric both for average and tail latency)
50%vs50%, 4 congestors, default msg sizes - no congestion
20%vs80%, 1 congestor, default msg sizes - no congestion when I switch between all available congestors
50%vs50%, 1 congestor, default msg sizes - no congestion when I switch between all available congestors
Same picture when I try to change message size of congestors.
These tests was done both for 18 PPN and 36 PPN (Hyper-Threading).

I'm curious why there is no congestion impact in all scenarios (except some random noise from time to time). I came up with the several gipotheses:

Assuming that MPI ranks on hosts utilize full 100Gbit/s link capacity, there is a big head room in switch buffers to process and forward packets fast enough, since I used 28 of 32 available nodes, and 4 ports on switch are not used. So, the recommendation from README is not satisfied: "network_load_test should not be run at much less than full system scale (ie, run on at least 95% of system nodes)"
System scale is not big enough. But if so, then what is the baseline system scale for InfiniBand where we could see congestion impact on GPCNeT?
For some strange reason MPI ranks aren't able to push enough traffic in the network?

In my opinion the first one gipothesis with head room in the switch is the case here.

What do you think? Maybe there was attemts to run GPCNeT on small clusters not mentioned in the paper?

In any case I would be grateful for any discussion or explanation :-)

Mikhail

netbench / gpcnet Goto Github PK

gpcnet's People

Contributors

Stargazers

Watchers

Forkers

gpcnet's Issues

Linking results in multiple definition errors

trying to understand gpcnet output

Running gpcnet on ARM with EFA

Set minimum time for benchmark

Enable new mode in GPCNeT for congestor only

GPCNeT results on small 32 node cluster

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent