openpmix / prrte Goto Github PK
View Code? Open in Web Editor NEWPMIx Reference RunTime Environment (PRRTE)
Home Page: https://pmix.org
License: Other
PMIx Reference RunTime Environment (PRRTE)
Home Page: https://pmix.org
License: Other
Thank you for taking the time to submit an issue!
master @ ffe3dd3
external
master @ 3f81378fc76c12c6564c2fce2c69608a286a1707
git clone (with external libevent, pmix, enable-debug)
iof_hnp finalizes before all processes IOF COMPLETE, the read handlers of some processes are not released when hnp start to finalize.
$prte -pmca pmix ext4x -pmca routed direct -pmca pmix_base_verbose 2 -pmca iof_base_verbose 10 -pmca state_base_verbose 10 -debug-daemons
runing mpi application compiled with mpicc. ompi and prrte using the same external pmix and libevent.
I included verbose output from iof and state, sorry about so much information, but i think this is helpful.
[saturn.icl.utk.edu:81156] [[6128,0],0] orted_cmd: received add_local_procs
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pulling fd 38 for process [[6128,14],0]
[saturn.icl.utk.edu:81156] defining endpt: file iof_hnp.c line 366 fd 38
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pushing fd 32 for process [[6128,14],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] defining read event for [[6128,14],0]: iof_hnp.c 187
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pushing fd 39 for process [[6128,14],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] defining read event for [[6128,14],0]: iof_hnp.c 190
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pushing fd 41 for process [[6128,14],1]
[saturn.icl.utk.edu:81156] [[6128,0],0] defining read event for [[6128,14],1]: iof_hnp.c 187
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pushing fd 43 for process [[6128,14],1]
[saturn.icl.utk.edu:81156] [[6128,0],0] defining read event for [[6128,14],1]: iof_hnp.c 190
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pushing fd 45 for process [[6128,14],2]
[saturn.icl.utk.edu:81156] [[6128,0],0] defining read event for [[6128,14],2]: iof_hnp.c 187
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pushing fd 47 for process [[6128,14],2]
[saturn.icl.utk.edu:81156] [[6128,0],0] defining read event for [[6128,14],2]: iof_hnp.c 190
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pushing fd 49 for process [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] defining read event for [[6128,14],3]: iof_hnp.c 187
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pushing fd 51 for process [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] defining read event for [[6128,14],3]: iof_hnp.c 190
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],0] STATE RUNNING AT base/odls_base_default_fns.c:1185
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],0] STATE RUNNING PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],1] STATE RUNNING AT base/odls_base_default_fns.c:1185
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],1] STATE RUNNING PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],2] STATE RUNNING AT base/odls_base_default_fns.c:1185
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],2] STATE RUNNING PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE RUNNING AT base/odls_base_default_fns.c:1185
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE RUNNING PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],0] state RUNNING
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],1] state RUNNING
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],2] state RUNNING
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],3] state RUNNING
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE JOB [6128,14] STATE RUNNING AT base/state_base_fns.c:683
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING JOB [6128,14] STATE RUNNING PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pushing fd 0 for process [[6128,14],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],0] STATE SYNC REGISTERED AT orted/pmix/pmix_server_gen.c:89
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],0] STATE SYNC REGISTERED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],0] state SYNC REGISTERED
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],1] STATE SYNC REGISTERED AT orted/pmix/pmix_server_gen.c:89
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],1] STATE SYNC REGISTERED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],1] state SYNC REGISTERED
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE SYNC REGISTERED AT orted/pmix/pmix_server_gen.c:89
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE SYNC REGISTERED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],2] STATE SYNC REGISTERED AT orted/pmix/pmix_server_gen.c:89
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],2] STATE SYNC REGISTERED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],3] state SYNC REGISTERED
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],2] state SYNC REGISTERED
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE JOB [6128,14] STATE SYNC REGISTERED AT base/state_base_fns.c:693
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING JOB [6128,14] STATE SYNC REGISTERED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE JOB [6128,14] STATE READY FOR DEBUGGERS AT base/plm_base_launch_support.c:731
[saturn.icl.utk.edu:81156] ACTIVATE: ANY STATE NOT FOUND
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],2] of size 31 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 31 bytes from stdout of [[6128,14],2]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],0] of size 31 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 31 bytes from stdout of [[6128,14],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],1] of size 146 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 146 bytes from stdout of [[6128,14],1]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],3] of size 31 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 31 bytes from stdout of [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],0] of size 116 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 116 bytes from stdout of [[6128,14],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],3] of size 116 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 116 bytes from stdout of [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],2] of size 346 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 346 bytes from stdout of [[6128,14],2]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],1] of size 346 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 346 bytes from stdout of [[6128,14],1]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],0] of size 231 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 231 bytes from stdout of [[6128,14],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],3] of size 243 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 243 bytes from stdout of [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp:read handler [[6128,14],3] Error on connection:49
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],3] of size 0 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 0 bytes from stdout of [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],3] of size 0 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 0 bytes from stderr of [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE IOF COMPLETE AT iof_hnp_read.c:328
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE IOF COMPLETE PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],3] state IOF COMPLETE
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE ABORTED BY SIGNAL AT base/odls_base_default_fns.c:1897
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE ABORTED BY SIGNAL PRI 0
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE WAITPID FIRED AT errmgr_default_hnp.c:647
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE WAITPID FIRED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],3] state WAITPID FIRED
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE NORMALLY TERMINATED AT base/state_base_fns.c:715
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE NORMALLY TERMINATED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],3] state NORMALLY TERMINATED
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE NORMALLY TERMINATED AT errmgr_default_hnp.c:206
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE NORMALLY TERMINATED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:cleanup_node on proc [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE ABORTED BY SIGNAL AT base/plm_base_receive.c:352
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE ABORTED BY SIGNAL PRI 0
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE WAITPID FIRED AT errmgr_default_hnp.c:647
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE WAITPID FIRED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],3] state NORMALLY TERMINATED
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:cleanup_node on proc [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],3] state WAITPID FIRED
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE NORMALLY TERMINATED AT base/state_base_fns.c:715
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE NORMALLY TERMINATED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],3] state NORMALLY TERMINATED
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:cleanup_node on proc [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],0] of size 97 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 97 bytes from stdout of [[6128,14],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],1] of size 106 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 106 bytes from stdout of [[6128,14],1]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],2] of size 106 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 106 bytes from stdout of [[6128,14],2]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],0] of size 97 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 97 bytes from stdout of [[6128,14],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp:read handler [[6128,14],1] Error on connection:41
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],1] of size 0 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 0 bytes from stdout of [[6128,14],1]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],1] of size 0 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 0 bytes from stderr of [[6128,14],1]
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],1] STATE IOF COMPLETE AT iof_hnp_read.c:328
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],1] STATE IOF COMPLETE PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],1] state IOF COMPLETE
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],1] STATE WAITPID FIRED AT base/odls_base_default_fns.c:1897
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],1] STATE WAITPID FIRED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],1] state WAITPID FIRED
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],1] STATE NORMALLY TERMINATED AT base/state_base_fns.c:715
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],1] STATE NORMALLY TERMINATED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],1] state NORMALLY TERMINATED
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:cleanup_node on proc [[6128,14],1]
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE JOB [6128,14] STATE NORMALLY TERMINATED AT base/state_base_fns.c:775
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING JOB [6128,14] STATE NORMALLY TERMINATED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:dvm:check_job_complete on job [6128,14]
[saturn.icl.utk.edu:81156] [[6128,0],0] state:dvm releasing procs from node saturn
[saturn.icl.utk.edu:81156] [[6128,0],0] state:dvm releasing proc [[6128,14],0] from node saturn
[saturn.icl.utk.edu:81156] [[6128,0],0] state:dvm releasing proc [[6128,14],2] from node saturn
[saturn.icl.utk.edu:81156] [[6128,0],0] state:dvm:check_job_completed state is terminated - activating notify
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE JOB [6128,14] STATE NOTIFY COMPLETED AT state_dvm.c:588
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING JOB [6128,14] STATE NOTIFY COMPLETED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp:read handler [[6128,14],2] Error on connection:45
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],2] of size 0 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 0 bytes from stdout of [[6128,14],2]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],2] of size 0 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 0 bytes from stderr of [[6128,14],2]
prte: base/iof_base_frame.c:195: orte_iof_base_proc_destruct: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (ptr->revstderr))->obj_magic_id' failed.
PROC [[6128,14],1] and [[6128,14],3] NORMALLY TERMINATED,
PROC [[6128,14],2] and [[6128,14],0] are still doiong io forwarding when job start to terminate.
Thank you for taking the time to submit an issue!
compilation fails
git master @ b7fdd9b
external
github master @ 6c18b47a34621972bb4ab9cfd19a27f1f3587e97
git clone
CC orted/pmix/pmix_server.lo
CC orted/pmix/pmix_server_fence.lo
CC orted/pmix/pmix_server_register_fns.lo
CC orted/pmix/pmix_server_dyn.lo
CC orted/pmix/pmix_server_pub.lo
CC orted/pmix/pmix_server_gen.lo
../../psrvr-git/orte/orted/pmix/pmix_server_gen.c: In function 'pmix_server_notify':
../../psrvr-git/orte/orted/pmix/pmix_server_gen.c:418:18: warning: assignment from incompatible pointer type [enabled by default]
cd->info = OBJ_NEW(opal_list_t);
^
../../psrvr-git/orte/orted/pmix/pmix_server_gen.c:420:5: error: 'val' undeclared (first use in this function)
val = OBJ_NEW(opal_value_t);
^
../../psrvr-git/orte/orted/pmix/pmix_server_gen.c:420:5: note: each undeclared identifier is reported only once for each function it appears in
../../psrvr-git/orte/orted/pmix/pmix_server_gen.c:424:5: warning: passing argument 1 of '_opal_list_append' from incompatible pointer type [enabled by default]
opal_list_append(cd->info, &val->super);
^
In file included from ../../psrvr-git/opal/dss/dss_types.h:42:0,
from ../../psrvr-git/opal/dss/dss.h:32,
from ../../psrvr-git/orte/orted/pmix/pmix_server_gen.c:37:
../../psrvr-git/opal/class/opal_list.h:544:20: note: expected 'struct opal_list_t *' but argument is of type 'struct pmix_info_t *'
static inline void _opal_list_append(opal_list_t *list, opal_list_item_t *item
^
../../psrvr-git/orte/orted/pmix/pmix_server_gen.c: In function 'pmix_server_notify_event':
../../psrvr-git/orte/orted/pmix/pmix_server_gen.c:461:23: error: 'val' undeclared (first use in this function)
OPAL_LIST_FOREACH(val, info, opal_value_t) {
^
../../psrvr-git/opal/class/opal_list.h:215:8: note: in definition of macro 'OPAL_LIST_FOREACH'
for (item = (type *) (list)->opal_list_sentinel.opal_list_next ; \
^
../../psrvr-git/opal/class/opal_list.h:215:30: error: 'pmix_info_t' has no member named 'opal_list_sentinel'
for (item = (type *) (list)->opal_list_sentinel.opal_list_next ; \
^
../../psrvr-git/orte/orted/pmix/pmix_server_gen.c:461:5: note: in expansion of macro 'OPAL_LIST_FOREACH'
OPAL_LIST_FOREACH(val, info, opal_value_t) {
^
../../psrvr-git/opal/class/opal_list.h:216:32: error: 'pmix_info_t' has no member named 'opal_list_sentinel'
item != (type *) &(list)->opal_list_sentinel ; \
^
../../psrvr-git/orte/orted/pmix/pmix_server_gen.c:461:5: note: in expansion of macro 'OPAL_LIST_FOREACH'
OPAL_LIST_FOREACH(val, info, opal_value_t) {
^
make[2]: *** [orted/pmix/pmix_server_gen.lo] Error 1
make[2]: Leaving directory `/gpfs/home/arcurtis/src/prrte/build/orte'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/gpfs/home/arcurtis/src/prrte/build/orte'
make: *** [all-recursive] Error 1
I've tracked down the instability problem to an issue in the IOF, specifically when forwarding IO to the prun tool. At some point, we wind up hitting a situation where the write fails (reason unclear) and then the msg attempts to be released, only to segfault due to a bad pointer:
Execution 20
[Ralphs-iMac-2.local:18679] SENDONEWAY server/pmix_server.c:1677:2
[Ralphs-iMac-2.local:18679] pmix_ptl_base: send_msg: write failed: Invalid argument (22) [sd = 14]
prte(18679,0x70000c640000) malloc: *** error for object 0x7fc5ed00f348: pointer being freed was not allocated
prte(18679,0x70000c640000) malloc: *** set a breakpoint in malloc_error_break to debug
[Ralphs-iMac-2.local:19175] PMIX ERROR: UNPACK-PAST-END in file event/pmix_event_registration.c at line 99
The last error comes from prun because it received only a partial payload.
It is unclear to me why the msg object is getting trashed. Here is what I see from gdb of the resulting prte core file:
(gdb) thread apply all where
Thread 5 (core thread 4):
#0 0x00007fff7c6655aa in select$DARWIN_EXTSN ()
#1 0x00000001029432ab in listen_thread (obj=0x102951850) at oob_tcp_listener.c:705
#2 0x00007fff7c718305 in _pthread_body ()
#3 0x00007fff7c71b26f in _pthread_start ()
#4 0x00007fff7c717415 in thread_start ()
Thread 4 (core thread 3):
#0 0x00007fff7c664716 in kevent ()
#1 0x00000001005a70c6 in kq_dispatch (base=0x7fc5eaf0c380, tv=<value temporarily unavailable, due to optimizations>) at kqueue.c:302
#2 0x00007000000f421a in ?? ()
Thread 3 (core thread 2):
#0 0x00007fff7c6655aa in select$DARWIN_EXTSN ()
#1 0x00000001004f2713 in listen_thread (obj=0x0) at base/ptl_base_listener.c:214
#2 0x00007fff7c718305 in _pthread_body ()
#3 0x00007fff7c71b26f in _pthread_start ()
#4 0x00007fff7c717415 in thread_start ()
Thread 2 (core thread 1):
#0 0x00007fff7c66423e in __pthread_kill ()
#1 0x00007fff7c71ac1c in pthread_kill ()
#2 0x00007fff7c5cd268 in __abort ()
#3 0x00007fff7c5cd1d8 in abort ()
#4 0x00007fff7c6dc6e2 in malloc_vreport ()
#5 0x00007fff7c6dc4a3 in malloc_report ()
#6 0x00000001004efd63 in pmix_ptl_base_send_handler (sd=14, flags=4, cbdata=0x7fc5ed00f1c0) at base/ptl_base_sendrecv.c:438
#7 0x000000010059e0e8 in event_process_active_single_queue (base=0x7fc5eaf005b0, activeq=0x7fc5eaf00880, max_to_process=2147483647, endtime=0x0) at event.c:1580
Previous frame inner to this frame (gdb could not unwind past this frame)
Thread 1 (core thread 0):
#0 0x00007fff7c6617de in __psynch_cvwait ()
#1 0x00007fff7c71b593 in _pthread_cond_wait ()
#2 0x00000001002af994 in orte_state_base_track_procs (fd=-1, argc=4, cbdata=0x7fc5ed0333b0) at base/state_base_fns.c:732
#3 0x000000010059de75 in event_process_active_single_queue (base=0x7fc5eac14860, activeq=0x7fc5eac14bf0, max_to_process=2147483647, endtime=0x0) at event.c:1646
Previous frame inner to this frame (gdb could not unwind past this frame)
(gdb) thread 2\
[Switching to thread 2 (core thread 1)]
0x00007fff7c66423e in __pthread_kill ()
(gdb) up
#1 0x00007fff7c71ac1c in pthread_kill ()
(gdb) up
#2 0x00007fff7c5cd268 in __abort ()
(gdb) up
#3 0x00007fff7c5cd1d8 in abort ()
(gdb) up
#4 0x00007fff7c6dc6e2 in malloc_vreport ()
(gdb) up
#5 0x00007fff7c6dc4a3 in malloc_report ()
(gdb) up
#6 0x00000001004efd63 in pmix_ptl_base_send_handler (sd=14, flags=4, cbdata=0x7fc5ed00f1c0) at base/ptl_base_sendrecv.c:438
438 PMIX_RELEASE(msg);
(gdb) print msg
$1 = (pmix_ptl_send_t *) 0x7fc5ed00f348
(gdb) print *msg
$2 = {
super = {
super = {
obj_magic_id = 0,
obj_class = 0x100526940,
obj_reference_count = 0,
cls_init_file_name = 0x1005071e7 "base/ptl_base_sendrecv.c",
cls_init_lineno = 438
},
pmix_list_next = 0x0,
pmix_list_prev = 0x0,
item_free = 1,
pmix_list_item_refcount = 0,
pmix_list_item_belong_to = 0x0
},
ev = {
ev_evcallback = {
evcb_active_next = {
tqe_next = 0xd8,
tqe_prev = 0x7fc5ed00f348
},
evcb_flags = 0,
evcb_pri = 0 '\0',
evcb_closure = 0 '\0',
evcb_cb_union = {
evcb_callback = 0x7fc500000000,
evcb_selfcb = 0x7fc500000000,
evcb_evfinalize = 0x7fc500000000,
evcb_cbfinalize = 0x7fc500000000
},
evcb_arg = 0x14000003e8
},
ev_timeout_pos = {
ev_next_with_common_timeout = {
tqe_next = 0xdeafbeeddeafbeed,
tqe_prev = 0x100526980
},
min_heap_idx = -558907667
},
ev_fd = 1,
ev_base = 0x1004f84ae,
ev_ = {
ev_io = {
ev_io_next = {
le_next = 0x7fc5000000ba,
le_prev = 0xdeafbeeddeafbeed
},
ev_timeout = {
tv_sec = 4300368192,
tv_usec = 1
}
},
ev_signal = {
ev_signal_next = {
le_next = 0x7fc5000000ba,
le_prev = 0xdeafbeeddeafbeed
},
ev_ncalls = 26944,
ev_pncalls = 0x7fc500000001
}
},
ev_events = 28067,
ev_res = 79,
ev_timeout = {
tv_sec = 140484085284953,
tv_usec = -318704672
}
},
hdr = {
pindex = -318704672,
tag = 32709,
nbytes = 4294967297
},
data = 0x7fc5ed00f3b8,
hdr_sent = false,
sdptr = 0xdeafbeeddeafbeed <Address 0xdeafbeeddeafbeed out of bounds>,
sdbytes = 4300368256
}
(gdb) print msg->data
$3 = (pmix_buffer_t *) 0x7fc5ed00f3b8
(gdb) print *msg->data
$4 = {
parent = {
obj_magic_id = 16046253926196952813,
obj_class = 0x100526980,
obj_reference_count = 1,
cls_init_file_name = 0x1004f84ae "include/pmix_globals.c",
cls_init_lineno = 186
},
type = 237 '?',
base_ptr = 0x100526940 "?mO",
pack_ptr = 0x7fc500000001 <Address 0x7fc500000001 out of bounds>,
unpack_ptr = 0x1004f6da3 "class/pmix_list.c",
bytes_allocated = 140484085284953,
bytes_used = 140488061547488
}
Can you take a look when you return? I'm guessing that we were only able to do a partial send, and that messed up the pointer to msg such that the subsequent attempt to complete the send errors out. Since the msg pointer has been messed up, we hit the "malloc free" error and abort.
Currently, the only way for an application process to connect to a PMIx server is by being spawned by that server - only tools have the logic to "discover" a PMIx server. This begs the question for PRRTE: how do we support singleton operations?
One possibility is to modify the client code to match that of a tool - i.e., if not given contact info, then search for it. However, this does beg some security issues that we deal with for tools, but not necessarily for apps. It also begins to blur the distinction between the two categories.
Another option would be to have the singleton spin off its own "prun" to support it ala what ORTE did - but that always left a sour taste in my mouth.
Any thoughts? My personal leaning would be to allow singletons to self-discover the local server, but to identify themselves as an app instead of a tool to make their intent clear for future places where we might want to differentiate them. For example, we allow a tool to drop a rendezvous file for subsequent attachment, but we don't provide that ability to an app.
Thank you for taking the time to submit an issue!
Launch of PMIX clients fails/times-out with latest PRRTE
master @ d54aa74
3.1.4 release and master @ 80f80b17589232eee49f6807afda2b853aee51d2
Programs launched on > 4 nodes (4 might just be the point at which I'm seeing the issue and have no specific inherent meaning) under PBS/Torque time out with
ORTE has lost communication with a remote daemon.
HNP daemon : [[12106,0],0] on node cn099
Remote daemon: [[12106,0],1] on node cn002
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
Same programs launch immediately using Open-MPI 4.0.x as PMIx server.
Thank you for taking the time to submit an issue!
We have the need to test the DVM features provided by PRRTE through the prte
and prun
commands (basically testing the run-a-job-in-job model). These tests need to be executed on OLCF platforms at ORNL.
master
master
The goal so far was to provide all the mechanisms to test different use cases of PRRTE-DVM on OLCF systems. This implies a few constraints:
The goal is to test different use cases relying on the distributed virtual machine capabilities of PRRTE. This use cases are driven by application teams' feedback, at the moment mainly from ORNL and the RADICAL team (http://radical.rutgers.edu).
The goal of this use case is to test the scalability when using as many nodes as possible on a platform, while using all the computing resources (cores at the moment) on compute nodes. The test shall fail if all the nodes on the platform (or at least a target number of nodes) cannot be used to run a simple /bin/hostname on each node of the allocation.
The goal of this use case is to test the scalability of DVM when using as many nodes as possible with oversubscription and short leaving applications. The workload will be predefined (many tasks model) and the test will discover the upper limit in term of number of nodes to run the test. The test will succeed if the upper limit is equal or superior to the target number of nodes for a given platform. The idea behind this test is also to assume that users can submit a large number of sub-jobs and DVM will throttle the sub-job execution to guarantee large throughput (we do not have any quantitative requirements regarding the throughput at the moment).
The goal of this use case is to test the scalability of DVM when using as many nodes as possible with no oversubscription and applications that run for a random amount of time. The number of tasks will be predefined but the total execution time required by the workload will be defined at runtime. The goal of this test is to evaluate the robustness of the infrastructure when running different types of applications.
Because of the environment at our center, integration with job/resource managers is mandatory (we cannot rely only on tests that require interactive sessions). This implies the need for an architecture where various resource/job managers can easily be added.
A simple proof-of-concept has been developed and used for evaluation on OLCF systems. The current version has been developed in Perl, the only programming language available on all target platforms when the project started. The programming language choice could be reconsidered at this time.
Developments are based on an incremental approach, meaning that only the first supported use case is currently implemented. Testing is at the moment focusing on the Summitdev system at ORNL and once we will be able to pass our first test on the entire system, that test will be executed at larger scale on Summit, while other use cases will be implemented and tested on Summitdev.
@ggouaillardet Can you take a look at this? Travis no longer seems to run, but I don't see an obvious reason for it.
Thank you for taking the time to submit an issue!
Compilation error on FeeBSD 12 in ptrace() call
github master @ d64505a
github master @ a3cfa97da6983a33411e367f6a250964cff1dc55
Making all in mca/odls/default
CC odls_default_module.lo
odls_default_module.c: In function 'do_parent':
odls_default_module.c:473:20: error: 'PTRACE_DETACH' undeclared (first use in this function); did you mean 'PRRTE_DETACH'?
473 | ptrace(PTRACE_DETACH, cd->child->pid, 0, (void*)SIGSTOP);
| ^~~~~~~~~~~~~
| PRRTE_DETACH
odls_default_module.c:473:20: note: each undeclared identifier is reported only once for each function it appears in
odls_default_module.c:473:54: warning: passing argument 4 of 'ptrace' makes integer from pointer without a cast [-Wint-conversion]
473 | ptrace(PTRACE_DETACH, cd->child->pid, 0, (void*)SIGSTOP);
| ^
| |
| void *
In file included from odls_default_module.c:113:
/usr/include/sys/ptrace.h:220:57: note: expected 'int' but argument is of type 'void *'
220 | int ptrace(int _request, pid_t _pid, caddr_t _addr, int _data);
| ~~~~^~~~~
*** Error code 1
Thank you for taking the time to submit an issue!
git master @ ffe3dd3
external
git master @ 3f81378fc76c12c6564c2fce2c69608a286a1707
git clone (with external libevent, pmix, enable-debug)
allocate 2 nodes
salloc -k -N 2
Start the dvm using
$prte -pmca pmix ext4x -pmca pmix_server_base_verbose 10 -debug-daemons
running example log.c under prrte/examples
$prun -np 4 log --global-syslog
DVM ready
[phi.icl.utk.edu:19807] SWITCHYARD for 578093057:0:27
[phi.icl.utk.edu:19807] recvd pmix cmd JOB CONTROL from 578093057:0
[phi.icl.utk.edu:19807] recvd job control request from client
[phi.icl.utk.edu:19807] SWITCHYARD for 578093057:0:27
[phi.icl.utk.edu:19807] recvd pmix cmd REGISTER EVENT HANDLER from 578093057:0
[phi.icl.utk.edu:19807] server:regevents_cbfunc called status = 0
[phi.icl.utk.edu:19807] SWITCHYARD for 578093057:0:27
[phi.icl.utk.edu:19807] recvd pmix cmd SPAWN from 578093057:0
[phi.icl.utk.edu:19807] [[8821,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
[phi.icl.utk.edu:19807] [[8821,0],0] orted_cmd: received add_local_procs
[phi.icl.utk.edu:19807] pmix:server _register_nspace 578093058
[helium.phi:20416] [[8821,0],1] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
[helium.phi:20416] [[8821,0],1] orted_cmd: received add_local_procs
[lithium.phi:18531] [[8821,0],2] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
[lithium.phi:18531] [[8821,0],2] orted_cmd: received add_local_procs
[helium.phi:20416] pmix:server register client 578093058:0
[helium.phi:20416] pmix:server register client 578093058:1
[helium.phi:20416] pmix:server register client 578093058:2
[helium.phi:20416] pmix:server register client 578093058:3
[helium.phi:20416] pmix:server _register_client for nspace 578093058 rank 0
[helium.phi:20416] pmix:server _register_client for nspace 578093058 rank 1
[helium.phi:20416] pmix:server _register_client for nspace 578093058 rank 2
[helium.phi:20416] pmix:server _register_client for nspace 578093058 rank 3
[lithium.phi:18531] pmix:server _register_nspace 578093058
[helium.phi:20416] pmix:server _register_nspace 578093058
[helium.phi:20416] pmix:server setup_fork for nspace 578093058 rank 0
[helium.phi:20416] pmix:server setup_fork for nspace 578093058 rank 1
[helium.phi:20416] pmix:server setup_fork for nspace 578093058 rank 2
[helium.phi:20416] pmix:server setup_fork for nspace 578093058 rank 3
[phi.icl.utk.edu:19807] SWITCHYARD for 578093057:0:27
[phi.icl.utk.edu:19807] recvd pmix cmd REGISTER EVENT HANDLER from 578093057:0
[phi.icl.utk.edu:19807] server:regevents_cbfunc called status = 0
[helium.phi:20416] SWITCHYARD for 578093058:0:22
[helium.phi:20416] recvd pmix cmd REQUEST INIT INFO from 578093058:0
[helium.phi:20416] SWITCHYARD for 578093058:1:23
[helium.phi:20416] recvd pmix cmd REQUEST INIT INFO from 578093058:1
[helium.phi:20416] SWITCHYARD for 578093058:0:22
[helium.phi:20416] recvd pmix cmd LOG from 578093058:0
[helium.phi:20416] recvd log from client
prted: orted/pmix/pmix_server_gen.c:1204: pmix_server_log_fn: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (&bo))->obj_magic_id' failed.
srun: error: helium: task 0: Aborted (core dumped)
srun: Terminating job step 7723.0
[lithium.phi:18531] [[8821,0],2]:base/ess_base_std_orted.c(676) updating exit status to 1
(null): Forwarding signal 18 to job
[lithium.phi:18531] pmix:server finalize called
[lithium.phi:18531] pmix:server finalize complete
srun: error: lithium: task 1: Exited with exit code 1
[phi.icl.utk.edu:19807] pmix:server finalize called
This works fine for 1 node, it only happens when you have multiple nodes. For my test, i use 2 nodes.
Thank you for taking the time to submit an issue!
With compiling PRRTE and when I forget to specify the location where I installed the PMIx library, the configure script tells me to use the --with-external-pmix
option but the correct option is, as far as I can tell, --with-pmix
, not --with-external-pmix
3.0.0rc1
3.0.2
$ lsb_release -a
LSB Version: :core-4.1-noarch:core-4.1-ppc64le
Distributor ID: RedHatEnterpriseServer
Description: Red Hat Enterprise Linux Server release 7.5 (Maipo)
Release: 7.5
Codename: Maipo
With compiling PRRTE and when I forget to specify the location where I installed the PMIx library, I am getting the following error message:
============================================================================
== Configure PMIx
============================================================================
checking --with-external-pmix value... not found
configure: WARNING: Expected file /usr/include/pmix.h not found
configure: error: Cannot continue
However, the correct option is, as far as I can tell, --with-pmix
and not --with-external-pmix
. When using --with-pmix
, everything is fine.
Meant to post to pmix issues
Thank you for taking the time to submit an issue!
master @ 2a0539a
pmix-3.1.4
Compile error in ras:lsf
component when building on Summit.
make[2]: Entering directory `/autofs/nccs-svm1_sw/summit/ums/ompix/DEVELOP/gcc/6.4.0/build/prrte-br-master/orte/mca/ras/lsf'
CC ras_lsf_module.lo
In file included from ../../../../../../../../source/prrte-br-master/orte/mca/ras/lsf/ras_lsf_module.c:37:0:
../../../../../../../../source/prrte-br-master/orte/mca/ras/lsf/ras_lsf_module.c: In function 'allocate':
../../../../../../../../source/prrte-br-master/orte/mca/ras/lsf/ras_lsf_module.c:122:65: error: 'orte_rmaps_base' undeclared (first use in this function)
} else if ((ORTE_MAPPING_GIVEN & ORTE_GET_MAPPING_DIRECTIVE(orte_rmaps_base.mapping)) ||
^
../../../../../../../../source/prrte-br-master/orte/mca/rmaps/rmaps_types.h:103:7: note: in definition of macro 'ORTE_GET_MAPPING_DIRECTIVE'
((pol) & 0xff00)
^~~
../../../../../../../../source/prrte-br-master/orte/mca/ras/lsf/ras_lsf_module.c:122:65: note: each undeclared identifier is reported only once for each function it appears in
} else if ((ORTE_MAPPING_GIVEN & ORTE_GET_MAPPING_DIRECTIVE(orte_rmaps_base.mapping)) ||
^
../../../../../../../../source/prrte-br-master/orte/mca/rmaps/rmaps_types.h:103:7: note: in definition of macro 'ORTE_GET_MAPPING_DIRECTIVE'
((pol) & 0xff00)
^~~
../../../../../../../../source/prrte-br-master/orte/mca/ras/lsf/ras_lsf_module.c:123:70: error: expected ')' before '{' token
OPAL_BINDING_POLICY_IS_SET(opal_hwloc_binding_policy) {
^
../../../../../../../../source/prrte-br-master/orte/mca/ras/lsf/ras_lsf_module.c:174:1: error: expected expression before '}' token
}
^
make[2]: *** [ras_lsf_module.lo] Error 1
make[2]: Leaving directory `/autofs/nccs-svm1_sw/summit/ums/ompix/DEVELOP/gcc/6.4.0/build/prrte-br-master/orte/mca/ras/lsf'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/autofs/nccs-svm1_sw/summit/ums/ompix/DEVELOP/gcc/6.4.0/build/prrte-br-master/orte'
make: *** [all-recursive] Error 1
Thank you for taking the time to submit an issue!
prte
crashes on startup
PRRTE git master @ 891a7dd
PMIx git master @ 257f6b4c9aced263824a4273996678985bea5d0d
but happening on other machines/platforms too
shell$ prte
[login2:57281] *** Process received signal ***
[login2:57281] Signal: Segmentation fault (11)
[login2:57281] Signal code: Address not mapped (1)
[login2:57281] Failing at address: 0x30
[login2:57281] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x2aaaacd256d0]
[login2:57281] [ 1] /gpfs/projects/ChapmanGroup/opt/pmix/git/lib/pmix/mca_pnet_tcp.so(+0x2403)[0x2aaab1d98403]
[login2:57281] [ 2] /gpfs/projects/ChapmanGroup/opt/pmix/git/lib/libpmix.so.0(pmix_pnet_base_select+0xd0)[0x2aaaab33d8e0]
[login2:57281] [ 3] /gpfs/projects/ChapmanGroup/opt/pmix/git/lib/libpmix.so.0(PMIx_server_init+0x741)[0x2aaaab2c2151]
[login2:57281] [ 4] /gpfs/projects/ChapmanGroup/opt/prrte/git/lib/libprteopen-rte.so.0(pmix_server_init+0x8ba)[0x2aaaaad1ef3a]
[login2:57281] [ 5] /gpfs/projects/ChapmanGroup/opt/prrte/git/lib/pmix/mca_ess_hnp.so(+0x4700)[0x2aaaae142700]
[login2:57281] [ 6] /gpfs/projects/ChapmanGroup/opt/prrte/git/lib/libprteopen-rte.so.0(orte_init+0x2c6)[0x2aaaaace51a6]
[login2:57281] [ 7] prte[0x4024ee]
[login2:57281] [ 8] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaacf54445]
[login2:57281] [ 9] prte[0x401d19]
[login2:57281] *** End of error message ***
Segmentation fault (core dumped)
gdb says:
shell$ gdb `which prte`
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /gpfs/projects/ChapmanGroup/opt/prrte/git/bin/prte...done.
(gdb) r
Starting program: /gpfs/projects/ChapmanGroup/opt/prrte/git/bin/prte
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: File "/gpfs/projects/ChapmanGroup/opt/gcc/git/lib64/libstdc++.so.6.0.26-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py".
To enable execution of this file add
add-auto-load-safe-path /gpfs/projects/ChapmanGroup/opt/gcc/git/lib64/libstdc++.so.6.0.26-gdb.py
line to your configuration file "/gpfs/home/arcurtis/.gdbinit".
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file "/gpfs/home/arcurtis/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
info "(gdb)Auto-loading safe path"
[New Thread 0x2aaaae13d700 (LWP 66951)]
Program received signal SIGSEGV, Segmentation fault.
0x00002aaab1d98403 in tcp_finalize ()
from /gpfs/projects/ChapmanGroup/opt/pmix/git/lib/pmix/mca_pnet_tcp.so
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.170-4.el7.x86_64 elfutils-libs-0.170-4.el7.x86_64 glibc-2.17-222.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-9.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 openssl-libs-1.0.2k-12.el7.x86_64 systemd-libs-219-57.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0 0x00002aaab1d98403 in tcp_finalize ()
from /gpfs/projects/ChapmanGroup/opt/pmix/git/lib/pmix/mca_pnet_tcp.so
#1 0x00002aaaab33d8e0 in pmix_pnet_base_select ()
from /gpfs/projects/ChapmanGroup/opt/pmix/git/lib/libpmix.so.0
#2 0x00002aaaab2c2151 in PMIx_server_init ()
from /gpfs/projects/ChapmanGroup/opt/pmix/git/lib/libpmix.so.0
#3 0x00002aaaaad1ef3a in pmix_server_init ()
from /gpfs/projects/ChapmanGroup/opt/prrte/git/lib/libprteopen-rte.so.0
#4 0x00002aaaae142700 in rte_init ()
from /gpfs/projects/ChapmanGroup/opt/prrte/git/lib/pmix/mca_ess_hnp.so
#5 0x00002aaaaace51a6 in orte_init ()
from /gpfs/projects/ChapmanGroup/opt/prrte/git/lib/libprteopen-rte.so.0
#6 0x00000000004024ee in main (argc=1, argv=0x7fffffffaf48)
at ../../../../prrte-git/orte/tools/prte/prte.c:369
Thank you for taking the time to submit an issue!
prte
never says "DVM ready" on compute nodes in SLURM cluster
git master @ d31f0db
git master @ 7962c62d4eeaadfe8411df2e058c8b909fbf529d
(and 3.1.4 release)
prte
launched on a SLURM compute node never says "DVM ready".
On a login/bare node, I get "DVM ready" immediately.
Let me know what debugging info to provide.
Thank you for taking the time to submit an issue!
git master @ 4c77d72
git master @ c82c6dca63036d06e75da3aff8df16165635a56c
Testing OpenSHMEM program that reads a value from stdin on start-up (rank 0 only).
With OpenMPI 3.1.3 acting as launcher/server with PMIx 2.1.1 all is well.
With PRRTE/PMIx git combination, I get the prompt, enter a value, but then no progress. Also takes significantly longer for the prompt to appear.
Simple test program:
#include <stdio.h>
#include <shmem.h>
int
main()
{
int n = -1;
int me;
shmem_init();
me = shmem_my_pe();
if (me == 0) {
printf("Enter n : "); fflush(stdout);
fscanf(stdin, "%d", &n);
printf("You entered %d\n", n);
}
shmem_barrier_all();
printf("PE %d: n = %d\n", me, n);
return 0;
}
Expected result:
$ oshrun -n 2 ./a.out
oshrun:prrte: found "prun"
oshrun:prrte: check matching "prte"
oshrun:prrte: no "prte", skipping
oshrun:prrte: check matching "psrvr"
oshrun:prrte: no "psrvr", skipping
oshrun:launch: look for "mpiexec"
oshrun:launch: using "/gpfs/projects/ChapmanGroup/opt/openmpi/3.1.3/bin/mpiexec"
oshrun:launch: "mpiexec -n 2 ./a.out"
oshrun:----------------------------------------------------------------------
Enter n : 23
You entered 23
PE 0: n = 23
PE 1: n = -1
oshrun:launch: done
Bad result:
$ oshrun -n 2 ./a.out
oshrun:prrte: found "prun"
oshrun:prrte: check matching "prte"
oshrun:prrte: found "prte"
oshrun:prrte: starting up
oshrun:prrte: pid 31639 says "DVM ready"
oshrun:launch: "prun -n 2 ./a.out"
oshrun:launch: application in process 31741
oshrun:----------------------------------------------------------------------
Enter n : 23
<nothing more, hang>
Playing a bit with PRRTE I ran into what I suspect might be a bug in prte
. I compiled master
with external PMIx 3.0.2. I then ran the simple test on my laptop: I started prte -d
in one terminal, and used prun
to run a client:
prun -np 1 pmix-3.0.2/examples/client
Client ns 2573926402 rank 0: Running
Client 2573926402:0 universe size 4
Client 2573926402:0 num procs 1
Client ns 2573926402 rank 0: Finalizing
Client ns 2573926402 rank 0:PMIx_Finalize successfully completed
So far so good. But then I Ctrl-C
the client and didn't let it finish cleanly. After that, when I tried to prun
angain, nothing happened. That is, prun
doesn't show any output. I still do get debug messages on the prte
console, so there is some activity going on between prun
and prte
, but the client
is not executed.
I've noticed that instead of aborting the client with a signal I can break prte
in the same way by running the alloc
example:
$ ~/work/pmi/install-prrte/bin/prun -np 1 ./alloc
Client ns 2554920962 rank 0: Running
Client 2554920962:0 universe size 4
Allocation request returned PROC-ABORT-REQUESTED
After that prun
doesn't do anything anymore. So it seems that some behavior in the client can put the server into an unusable state.
Thank you for taking the time to submit an issue!
After a recent update from git-pull, started getting the above error message when launching programs.
N.B. continues to work if PMIx configured with --enable-debug
.
git master @ e886f1d
external
git master @ f894bfce36d11913e81f05b54da0f1fead8c3701
clone
When running through prun I get:
arcurtis@cn-mem[1](~/shmem/novo-test) prte -pmca pmix_server_verbose 99 -pmca orte_data_server_verbose 99 -pmca orte_report_silent_errors 1 -pmca odls_base_verbose 99 &
[1] 34302
arcurtis@cn-mem[1](~/shmem/novo-test) [cn-mem:34302] mca: base: components_register: registering framework odls components
[cn-mem:34302] mca: base: components_register: found loaded component default
[cn-mem:34302] mca: base: components_register: component default has no register or open function
[cn-mem:34302] mca: base: components_open: opening odls components
[cn-mem:34302] mca: base: components_open: found loaded component default
[cn-mem:34302] mca: base: components_open: component default open function successful
[cn-mem:34302] mca:base:select: Auto-selecting odls components
[cn-mem:34302] mca:base:select:( odls) Querying component [default]
[cn-mem:34302] mca:base:select:( odls) Query of component [default] set priority to 10
[cn-mem:34302] mca:base:select:( odls) Selected component [default]
DVM ready
arcurtis@cn-mem[1](~/shmem/novo-test) prun -v -n 1 ./a.out
[cn-mem:34302] [[38001,0],0] TOOL CONNECTION REQUEST RECVD
[cn-mem:34302] [[38001,0],0] TOOL CONNECTION PROCESSING
[cn-mem:34302] [[38001,0],0] TOOL CONNECTION FROM UID 170008941 GID 170008941
[cn-mem:34302] [[38001,0],0] spawn called from proc [[38001,1],0]
[cn-mem:34302] *** Process received signal ***
[cn-mem:34302] Signal: Segmentation fault (11)
[cn-mem:34302] Signal code: Address not mapped (1)
[cn-mem:34302] Failing at address: 0x30
[cn-mem:34302] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2aaaace03370]
[cn-mem:34302] [ 1] /gpfs/home/arcurtis/opt/pmix/git/lib/pmix/mca_pnet_tcp.so(+0x24fc)[0x2aaab13334fc]
[cn-mem:34302] [ 2] /gpfs/home/arcurtis/opt/pmix/git/lib/pmix/mca_pnet_tcp.so(+0x6c3e)[0x2aaab1337c3e]
[cn-mem:34302] [ 3] /gpfs/home/arcurtis/opt/pmix/git/lib/libpmix.so.0(pmix_pnet_base_allocate+0x190)[0x2aaaab4b4550]
[cn-mem:34302] [ 4] /gpfs/home/arcurtis/opt/pmix/git/lib/libpmix.so.0(+0x56c26)[0x2aaaab462c26]
[cn-mem:34302] [ 5] /gpfs/projects/ChapmanGroup/opt/libevent/lib/libevent-2.1.so.6(+0x2153d)[0x2aaaab90b53d]
[cn-mem:34302] [ 6] /gpfs/projects/ChapmanGroup/opt/libevent/lib/libevent-2.1.so.6(event_base_loop+0x3ef)[0x2aaaab90bc4f]
[cn-mem:34302] [ 7] /gpfs/home/arcurtis/opt/pmix/git/lib/libpmix.so.0(+0x7348e)[0x2aaaab47f48e]
[cn-mem:34302] [ 8] /lib64/libpthread.so.0(+0x7dc5)[0x2aaaacdfbdc5]
[cn-mem:34302] [ 9] /lib64/libc.so.6(clone+0x6d)[0x2aaaad10776d]
[cn-mem:34302] *** End of error message ***
[cn-mem:34324] Job failed to spawn: UNREACHABLE
[1]+ Segmentation fault (core dumped) prte -pmca pmix_server_verbose 99 -pmca orte_data_server_verbose 99 -pmca orte_report_silent_errors 1 -pmca odls_base_verbose 99
GDB of prte:
gdb) r
Starting program: /gpfs/home/arcurtis/opt/prrte/git/bin/prte
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x2aaaae010700 (LWP 4311)]
[New Thread 0x2aaab1562700 (LWP 4312)]
[New Thread 0x2aaab2185700 (LWP 4313)]
[New Thread 0x2aaab2386700 (LWP 4314)]
DVM ready
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x2aaaae010700 (LWP 4311)]
0x00002aaab115a4cc in pmix_obj_run_destructors (
object=0x2aaab13613f0 <available+16>)
at /gpfs/home/arcurtis/src/pmix/pmix-git/src/class/pmix_object.h:452
452 cls_destruct = object->obj_class->cls_destruct_array;
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.163-3.el7.x86_64 elfutils-libs-0.163-3.el7.x86_64 glibc-2.17-157.el7_3.5.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.13.2-10.el7.x86_64 libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64 libgcc-4.8.5-4.el7.x86_64 libselinux-2.2.2-6.el7.x86_64 libstdc++-4.8.5-4.el7.x86_64 libxml2-2.9.1-6.el7_2.2.x86_64 openssl-libs-1.0.1e-51.el7_2.4.x86_64 pcre-8.32-15.el7.x86_64 systemd-libs-219-19.el7.x86_64 xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) where
#0 0x00002aaab115a4cc in pmix_obj_run_destructors (
object=0x2aaab13613f0 <available+16>)
at /gpfs/home/arcurtis/src/pmix/pmix-git/src/class/pmix_object.h:452
#1 ttdes (p=0x2aaab4034bc0)
at ../../../../../pmix-git/src/mca/pnet/tcp/pnet_tcp.c:181
#2 0x00002aaab115ec0e in pmix_obj_run_destructors (object=0x2aaab4034bc0)
at /gpfs/home/arcurtis/src/pmix/pmix-git/src/class/pmix_object.h:454
#3 allocate (nptr=0x2aaab4033f80, info=<optimized out>, ilist=0x2aaaae00fd70)
at ../../../../../pmix-git/src/mca/pnet/tcp/pnet_tcp.c:620
#4 0x00002aaaab4b7290 in pmix_pnet_base_allocate (nspace=<optimized out>,
info=<optimized out>, ninfo=<optimized out>, ilist=<optimized out>)
at ../../../../pmix-git/src/mca/pnet/base/pnet_base_fns.c:121
#5 0x00002aaaab464d96 in _setup_app (sd=<optimized out>,
args=<optimized out>, cbdata=0x822a00)
at ../../pmix-git/src/server/pmix_server.c:1461
#6 0x00002aaaab90e53d in event_process_active_single_queue (
base=base@entry=0x709950, activeq=0x709da0,
max_to_process=max_to_process@entry=2147483647, endtime=endtime@entry=0x0)
at event.c:1646
#7 0x00002aaaab90ec4f in event_process_active (base=0x709950) at event.c:1738
#8 event_base_loop (base=0x709950, flags=flags@entry=1) at event.c:1961
#9 0x00002aaaab4815fe in progress_engine (obj=<optimized out>)
at ../../pmix-git/src/runtime/pmix_progress_threads.c:109
#10 0x00002aaaacbfddc5 in start_thread () from /lib64/libpthread.so.0
#11 0x00002aaaacf0976d in clone () from /lib64/libc.so.6
(PMIx and PRRTE both using same version of hwloc.)
github master @ 716be58 (so that it compiles with PMIx 3.1.2)
PMIx 3.1.2 (w/ external hwloc 2.0.3)
CentOS Linux release 7.5.1804 x86_64
Xeon E5-2690 v3
Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]
Hello, we are the maintainers of the OpenSHMEM implementation OSSS-UCX, which uses PMIx to exchange UCX parameters during its start-up.
Details: https://github.com/openshmem-org/osss-ucx/blob/master/src/shmemc/ucx/pmix_client.c
Initially we use PMIx_Publish
and PMIx_Lookup
to do it, but this approach scales poorly on several HPC clusters that we have tested it on. For a simple hello world program that does nothing other than calling shmem_init()
and shmem_finalize()
, it takes OSSS-UCX about 120 seconds to do the job on 192 PEs. Below is a trimmed output of Linux kernel's perf
profiler.
|--90.89%--orte_rml_base_process_msg
| |
| --90.44%--orte_data_server
| |
| |--82.08%--orte_util_print_name_args
| | |
| | |--32.02%--__snprintf
| | | |
| | | --31.64%--_IO_vsnprintf
| | | |
| | | |--27.75%--vfprintf
| | | |
| | | |--1.62%--_IO_str_init_static_internal
| | | |
| | | --1.52%--_IO_no_init
| | |
| | |--30.60%--orte_util_print_jobids
| | | |
| | | |--29.22%--__snprintf
| | | | |
| | | | --28.81%--_IO_vsnprintf
| | | | |
| | | | |--25.23%--vfprintf
| | | | |
| | | | |--1.72%--_IO_str_init_static_internal
| | | | |
| | | | --1.03%--_IO_no_init
| | | |
| | | --0.60%--get_print_name_buffer
| | |
| | |--17.00%--orte_util_print_vpids
| | | |
| | | |--15.09%--__snprintf
| | | | |
| | | | --14.56%--_IO_vsnprintf
| | | | |
| | | | |--10.41%--vfprintf
| | | | |
| | | | |--1.77%--_IO_no_init
| | | | |
| | | | --1.52%--_IO_str_init_static_internal
| | | |
| | | --0.98%--get_print_name_buffer
| | | |
| | | --0.76%--pthread_getspecific
| | |
| | --1.25%--get_print_name_buffer
| | |
| | --0.80%--pthread_getspecific
| |
| |--3.78%--__strncmp_sse42
| |
| |--0.80%--pthread_mutex_unlock
| |
| --0.67%--pthread_mutex_lock
Apparently, the function orte_data_server
was called many many times and 90% of the total run time was spend on it.
Looking closer, the function orte_util_print_name_args
(ORTE_NAME_PRINT
) is the most expensive part, as it always formats the log strings even if no log gets printed.
I forked prrte and removed all the lines in orte_data_server
that contains ORTE_NAME_PRINT
, and this reduced the total run time of the hello world program to around 20 seconds (orte_data_server
is still being called many many times).
In the development branch of OSSS-UCX we have switched to PMIx_Get/Put/Commit
and now it only takes about 10 seconds to run the hello world program on 192 PEs without needing to remove the string formatting macro.
New version: https://bitbucket.org/wenblu/osss-ucx/src/master/src/shmemc/ucx/pmix_client.c
Thank you for taking the time to submit an issue!
PMIx server: github master @ 315681d
PMIx client: github master @ ef2575f3ac21a3261da16d827fe2efd27b46151c
git-clone
At program start, I occasionally see these messages from the client code
[cn090:05314] [[43133,0],1] ORTE_ERROR_LOG: Not found in file ../../prrte-git/orte/util/nidmap.c at line 761
[cn090:05314] [[43133,0],1] ORTE_ERROR_LOG: Not found in file ../../prrte-git/orte/orted/orted_comm.c at line 270
the program still continues to run fine though.
shell$ echo $PRRTE_ROOT
/install/prrte-master-x-master-dbg
shell$ cd $PRRTE_ROOT/bin
shell$ ln -s prun mpirun
shell$ cd $HOME
shell$ mpirun --map-by ppr:2:node --prefix $PRRTE_ROOT ./hello
--------------------------------------------------------------------------
Both a prefix was supplied to and the absolute path to was
given:
Prefix: /install/prrte-master-x-master-dbg
Path: /install/prrte-master-x-master-dbg/bin
Only one should be specified to avoid potential version
confusion. Operation will continue, but the -prefix option will be
used. This is done to allow you to select a different prefix for
the backend computation nodes than used on the frontend for .
--------------------------------------------------------------------------
sh: /install/prrte-master-x-master-dbg/prte: No such file or directory
mpirun failed to initialize, likely due to no DVM being available
A couple of items here.
prun:double-prefix
] is missing some string values.This was found my Open MPI MTT testing which tends to rely on the prefix to help set the specific build for that run.
Thank you for taking the time to submit an issue!
git master @ 4301061
ext
git master @ 3f81378fc76c12c6564c2fce2c69608a286a1707
git clone
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
make[2]: Entering directory `/gpfs/home/arcurtis/src/prrte/build/orte/tools/prun'
depbase=`echo prun.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I../../../../prrte-git/orte/tools/prun -I../../../opal/include -I../../../../prrte-git -I../../.. -I../../../../prrte-git/opal/include -I../../../../prrte-git/orte/include -I../../../orte/include -I/gpfs/home/arcurtis/opt/pmix/git/include -I/gpfs/projects/ChapmanGroup/opt/libevent/include -I/gpfs/home/arcurtis/opt/pmix/git/include -I/gpfs/home/arcurtis/opt/hwloc/2.0.1/include -DNDEBUG -ggdb -fno-strict-aliasing -mcx16 -pthread -g -MT prun.o -MD -MP -MF $depbase.Tpo -c -o prun.o ../../../../prrte-git/orte/tools/prun/prun.c &&\
mv -f $depbase.Tpo $depbase.Po
In file included from /gpfs/home/arcurtis/opt/pmix/git/include/pmix_common.h:2281:0,
from /gpfs/home/arcurtis/opt/pmix/git/include/pmix.h:52,
from ../../../../prrte-git/opal/pmix/pmix-internal.h:32,
from ../../../../prrte-git/orte/tools/prun/prun.c:61:
../../../../prrte-git/orte/tools/prun/prun.c: In function โprunโ:
../../../../prrte-git/orte/tools/prun/prun.c:656:34: error: โPMIX_LAUNCHER_RENDEZVOUS_FILEโ undeclared (first use in this function)
PMIX_INFO_LOAD(ds->info, PMIX_LAUNCHER_RENDEZVOUS_FILE, param, PMIX_STRING);
^
/gpfs/home/arcurtis/opt/pmix/git/include/pmix_extend.h:110:22: note: in definition of macro โPMIX_INFO_LOADโ
if (NULL != (k)) { \
^
../../../../prrte-git/orte/tools/prun/prun.c:656:34: note: each undeclared identifier is reported only once for each function it appears in
PMIX_INFO_LOAD(ds->info, PMIX_LAUNCHER_RENDEZVOUS_FILE, param, PMIX_STRING);
^
/gpfs/home/arcurtis/opt/pmix/git/include/pmix_extend.h:110:22: note: in definition of macro โPMIX_INFO_LOADโ
if (NULL != (k)) { \
^
make[2]: *** [prun.o] Error 1
make[2]: Leaving directory `/gpfs/home/arcurtis/src/prrte/build/orte/tools/prun'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/gpfs/home/arcurtis/src/prrte/build/orte'
make: *** [all-recursive] Error 1
Build failure with latest PRRTE master using pmix-3.1.3 due to missing PMIX_SERVER_SCHEDULER
in PMIx version that is unguarded in PRRTE. Not sure if this should be a configury check or how PMIx version specific bits are handled in PRRTE.
../configure \
--prefix=$PKG_INSTALL_PREFIX \
--with-hwloc=$HWLOC_INSTALL_DIR \
--with-pmix=$PMIX_INSTALL_DIR \
--with-libevent=$LIBEVENT_INSTALL_DIR \
--enable-orterun-prefix-by-default \
&& make -j 4 \
&& make install
...<snip>...
CC orted/pmix/pmix_server_pub.lo
In file included from /usr/include/string.h:630:0,
from /home/3t4/projects/pmix/ssd-pmix/prrte/install/include/hwloc.h:59,
from ../../opal/hwloc/hwloc-internal.h:28,
from ../../opal/util/proc.h:22,
from ../../orte/include/orte/types.h:30,
from ../../orte/orted/pmix/pmix_server.c:30:
../../orte/orted/pmix/pmix_server.c: In function โpmix_server_initโ:
../../orte/orted/pmix/pmix_server.c:382:26: error: โPMIX_SERVER_SCHEDULERโ undeclared (first use in this function)
kv->key = strdup(PMIX_SERVER_SCHEDULER);
^
../../orte/orted/pmix/pmix_server.c:382:26: note: each undeclared identifier is reported only once for each function it appears in
Makefile:1473: recipe for target 'orted/pmix/pmix_server.lo' failed
I'm launching this in my MTT (in a SLURM allocation with 2 servers, each with
16 cores) -- note the use of --oversubscribe in here:
-----
mpirun --oversubscribe --bind-to none -np 32 --mca orte_startup_timeout 10000
--mca oob tcp --mca btl tcp,self --mca mpi_leave_pinned_pipeline 1
src/mpi2c++_dynamics_test
-----
And I'm getting this:
-----
MPI-2 C++ bindings MPI-2 dynamics test suite
------------------------------
Open MPI Version 2.0
*** There are delays built into some of the tests
*** Please let them complete
*** No test should take more than 10 seconds
Test suite running with 32 processes
* MPI-2 Dynamics...
- Looking for "connect" program... PASS
- MPI::Get_version... PASS
- MPI::Open_port... PASS
- MPI::Intercomm::Spawn...
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 1 slots
that were requested by the application:
src/connect
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[mpi015:28434] *** An
error occurred in MPI_Comm_spawn
[mpi015:28434] *** reported by process [549847041,0]
[mpi015:28434] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[mpi015:28434] *** MPI_ERR_SPAWN: could not spawn processes
[mpi015:28434] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[mpi015:28434] *** and potentially your MPI job)
-----
Yes, I'm running with -np 32 on 32 slots, but I said --oversubscribe. So why did it fail?
Thank you for taking the time to submit an issue!
Sorry if opening the ticket goes again the community rules, I am not quite sure I have yet enough information to have a useful ticket.
I am trying to run scalability tests on various OLCF systems at ORNL to cover some of the needs of some of our users. My current test consist of starting N PEs on X nodes where N is the number of cores available on compute nodes times the number of nodes; basically filling up compute nodes and trying to find the upper limit regarding the number of nodes before we start to face problems. At the moment, I am trying to find the value of X where I start to face problems. For every test, I am running hostname
and to validate the test, I count the number of host names in the output. I acknowledge this might be the best test but it captures the needs from a user; I am willing to run other tests to capture scalability problems. I can also share my test.
PRRTE master fd34cfa
PMIx master 30c51d72c74f0d225cd60aa8e4ce46054e44603d
$ lsb_release -a
LSB Version: :core-4.1-noarch:core-4.1-ppc64le
Distributor ID: RedHatEnterpriseServer
Description: Red Hat Enterprise Linux Server release 7.5 (Maipo)
Release: 7.5
Codename: Maipo
I am currently running my tests on Summitdev at ORNL.
My test runs the following loop starting with 32 nodes and 20 PEs per node (one PE per core):
On Summitdev, I get the following in a very consistent manner:
For the last run with 16 nodes, I get the following error (I do not track error messages for all runs at the moment): [summitdev-login1:11547] PMIX ERROR: OUT-OF-RESOURCE in file /ccs/home/gvh/scratch/summitdev/prrte/pmix/src/src/server/pmix_server.c at line 1785 User defined signal 2
Based on this, I am suspecting a problem in the mapper since it should have all the required resources available.
The LSF script to start a job on 16 nodes looks like:
#!/bin/bash
# Begin LSF directives
#BSUB -P *****
#BSUB -J dvm_simple
#BSUB -o dvm_simple.out
#BSUB -e dvm_simple.err
#BSUB -W 00:10
#BSUB -nnodes 16
#BSUB -env "all"
# End LSF directives and begin shell commands
./get_list_hosts.pl
T="$(date +%s)"
echo "Starting DVM on 16 nodes..." >> ./dvm_simple_config.log
prte --prefix $PRRTE_DIR --report-uri prrteuri --hostfile ./DVM_HOSTS.txt &
echo "DVM started" >> ./dvm_simple_config.log
echo "Running job with 320 PEs..." >> ./dvm_simple_config.log
prun --prefix $PRRTE_DIR -np 320 hostname
echo "Job succeeded" >> ./dvm_simple_config.log
echo "Sleeping for 30 seconds to give a chance to all messages to come back from the nodes..."
sleep 30
echo "Terminating DVM..." >> ./dvm_simple_config.log
prun --prefix $PRRTE_DIR -terminate
echo "DVM teminated" >> ./dvm_simple_config.log
T="$(($(date +%s)-T))"
echo "Total job runtime: $T seconds" >> ./dvm_simple_config.log
Note that I included a sleep 30
to give a chance at the system to propagate back all the IO since I believe there is no IO flush in PRRTE at the moment.
This is using a PRRTE module that I generate for the system, PRRTE_DIR points at the install directory for PRRTE.
I will try to run the same test on Summit and Titan to see if I face the same limitations (these systems allow a different number of PEs per node).
Please let me know if you need any additional information, I will be happy to run any test to track this scalability problem.
Running prte_info
segfaults on startup
git master @ 52d4988
git master @ 88500d4
Linux desktop
shell$ prte_info
Segmentation fault
shell$ echo $?
139
(gdb) bt
#0 strlen () at ../sysdeps/x86_64/strlen.S:106
#1 0x00007ffff74d247e in __GI___strdup (s=0x0) at strdup.c:41
#2 0x00007ffff7ad0140 in prrte_mca_base_open ()
from /home/3t4/projects/ompi-ecp/ompi-scratch/CREEPY-CAT/ompi/_install/lib/libprrte.so.2
#3 0x0000000000402143 in main ()
(gdb)
Thank you for taking the time to submit an issue!
I use getenv() in my OpenSHMEM library which launches through PMIx. now with PRRTE as launcher getenv() always returns NULL despite environment variables being set.
git master @ 164ab7f
git master @ aa6fb1e3b2b5a340c427960b02d06c6ffa01bdc4
MWE below. When called through prte/prun, getenv() returns NULL despite VERBOSITY=1 in environment. Open-MPI as launcher picks up a string for "VERBOSITY".
#include <stdio.h>
#include <stdlib.h>
#include <pmix.h>
int
main()
{
pmix_proc_t p;
PMIx_Init(&p, NULL, 0);
char *v = getenv("VERBOSITY");
printf("%d: v = %p\n", p.rank, v);
PMIx_Finalize(NULL, 0);
return 0;
}
When a mapping fails the retval from prun
was set to success (0) instead of an abnormal termination value (non zero).
This can be reproduced like this:
# Ask for more procs than you have slots
shell$ prun -np 2 -host localhost ./hello_world ; echo $?
Thank you for taking the time to submit an issue!
Using master branch of prrte (a187840)
Using master branch of PMIx (1ca482d)
Happening on multiple systems, including Cori (Cray system @ NERSC) and Cooley (Linux cluster @ ALCF).
I am trying to roll my own PMIx environment for testing on Cori and running into issues trying to launch jobs across multiple nodes (2 nodes for now). I am able to invoke prte
across both nodes, without error (I think, as "DVM Ready" is being printed and no visible errors).
However, the behavior I get when launching jobs is not as expected. Just trying to run something simple like hostname
to verify my processes are being distributed as I would like (round-robin across nodes) is yielding the following:
ssnyder@nid00098:~/software/ssg/build> ~/software/pmix/prrte/install/bin/prun -n 4 hostname
nid00098
nid00098
nid00098
nid00098
So, all 4 processes are being launched on a single node. Maybe that's expected behavior for prun
in the absence of other command line arguments. It looks like there are maybe multiple ways to more explicitly get this sort of behavior, so I tried the following:
ssnyder@nid00098:~/software/ssg/build> ~/software/pmix/prrte/install/bin/prun --map-by node -n 4 hostname
nid00098
nid00098
prun: /global/homes/s/ssnyder/software/pmix/pmix/src/class/pmix_list.h:564: _pmix_list_append: Assertion `0 == item->pmix_list_item_refcount' failed.
[nid00098:47313] *** Process received signal ***
[nid00098:47313] Signal: Aborted (6)
[nid00098:47313] Signal code: (-6)
[nid00098:47313] [ 0] /lib64/libpthread.so.0(+0x12360)[0x2aaacc842360]
[nid00098:47313] [ 1] /lib64/libc.so.6(gsignal+0x110)[0x2aaacca84160]
[nid00098:47313] [ 2] /lib64/libc.so.6(abort+0x151)[0x2aaacca85741]
[nid00098:47313] [ 3] /lib64/libc.so.6(+0x2e75a)[0x2aaacca7c75a]
[nid00098:47313] [ 4] /lib64/libc.so.6(+0x2e7d2)[0x2aaacca7c7d2]
[nid00098:47313] [ 5] /global/homes/s/ssnyder/software/pmix/pmix/install/lib/pmix/mca_gds_hash.so(+0x2fa6)[0x2aaad08bffa6]
[nid00098:47313] [ 6] /global/homes/s/ssnyder/software/pmix/pmix/install/lib/pmix/mca_gds_hash.so(+0x614d)[0x2aaad08c314d]
[nid00098:47313] [ 7] /global/homes/s/ssnyder/software/pmix/pmix/install/lib/pmix/mca_gds_hash.so(+0x1203b)[0x2aaad08cf03b]
[nid00098:47313] [ 8] /global/homes/s/ssnyder/software/pmix/pmix/install/lib/libpmix.so.0(+0x68713)[0x2aaaab2e2713]
[nid00098:47313] [ 9] /global/homes/s/ssnyder/software/pmix/pmix/install/lib/libpmix.so.0(pmix_ptl_base_process_msg+0x35f)[0x2aaaab3b6001]
[nid00098:47313] [10] /global/u2/s/ssnyder/software/spack/opt/spack/cray-cnl7-haswell/gcc-8.3.0/libevent-2.1.8-a2ij5ml7twhl6oxmxtesm2fkjoafjaz5/lib/libevent-2.1.so.6[0x20023a15]
[nid00098:47313] [11] /global/u2/s/ssnyder/software/spack/opt/spack/cray-cnl7-haswell/gcc-8.3.0/libevent-2.1.8-a2ij5ml7twhl6oxmxtesm2fkjoafjaz5/lib/libevent-2.1.so.6(event_base_loop+0x51f)[0x200243ef]
[nid00098:47313] [12] /global/homes/s/ssnyder/software/pmix/pmix/install/lib/libpmix.so.0(+0xc2f3e)[0x2aaaab33cf3e]
[nid00098:47313] [13] /lib64/libpthread.so.0(+0x7569)[0x2aaacc837569]
[nid00098:47313] [14] /lib64/libc.so.6(clone+0x3f)[0x2aaaccb46a2f]
[nid00098:47313] *** End of error message ***
Aborted
So for some reason, prun
really didn't like that. It invokes 2 processes on my first node (nid00098), but never does so on the other node (nid00099). I suspected maybe prte
is just not running properly on the other node despite no errors when launching,
ssnyder@nid00098:~/software/ssg/build> ps aux | grep prte
ssnyder 48322 0.4 0.0 1418960 17468 pts/0 Sl 12:42 0:00 /global/homes/s/ssnyder/software/pmix/prrte/install/bin/prte -prefix /global/homes/s/ssnyder/software/pmix/prrte/install/
ssnyder 48327 0.2 0.0 393060 18552 pts/0 Sl 12:42 0:00 srun --ntasks-per-node=1 --kill-on-bad-exit --nodes=1 --nodelist=nid00099 --ntasks=1 prted -pmca ess "slurm" -pmca ess_base_jobid "1267859456" -pmca ess_base_vpid "1" -pmca ess_base_num_procs "2" -pmca orte_hnp_uri "1267859456.0;tcp://10.128.0.99:37339"
ssnyder 48337 0.0 0.0 188172 2300 pts/0 S 12:42 0:00 srun --ntasks-per-node=1 --kill-on-bad-exit --nodes=1 --nodelist=nid00099 --ntasks=1 prted -pmca ess "slurm" -pmca ess_base_jobid "1267859456" -pmca ess_base_vpid "1" -pmca ess_base_num_procs "2" -pmca orte_hnp_uri "1267859456.0;tcp://10.128.0.99:37339"
ssnyder@nid00098:~/software/ssg/build> ssh nid00099
ssnyder@nid00099:~> ps aux | grep prte
ssnyder 57297 0.2 0.0 1347144 17192 ? Sl 12:42 0:00 /global/homes/s/ssnyder/software/pmix/prrte/install/bin/prted -pmca ess "slurm" -pmca ess_base_jobid "1267859456" -pmca ess_base_vpid "1" -pmca ess_base_num_procs "2" -pmca orte_hnp_uri "1267859456.0;tcp://10.128.0.99:37339"
Here's the corresponding bt for the case of the prun failure:
Program terminated with signal SIGABRT, Aborted.
#0 0x00002aaacca84160 in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x2aaad16ee700 (LWP 48889))]
(gdb) bt
#0 0x00002aaacca84160 in raise () from /lib64/libc.so.6
#1 0x00002aaacca85741 in abort () from /lib64/libc.so.6
#2 0x00002aaacca7c75a in __assert_fail_base () from /lib64/libc.so.6
#3 0x00002aaacca7c7d2 in __assert_fail () from /lib64/libc.so.6
#4 0x00002aaad08bffa6 in _pmix_list_append (list=0x2aaad4001bd0, item=0x100000bdfa0,
FILE_NAME=0x2aaad08d6790 "../../../../../src/mca/gds/hash/gds_hash.c", LINENO=383)
at /global/homes/s/ssnyder/software/pmix/pmix/src/class/pmix_list.h:564
#5 0x00002aaad08c314d in process_node_array (val=0x2aaad4003220, tgt=0x2aaad4001bd0)
at ../../../../../src/mca/gds/hash/gds_hash.c:383
#6 0x00002aaad08cf03b in hash_store_job_info (nspace=0x2aaad16eda40 "1233256450",
buf=0x2aaad16edcb0) at ../../../../../src/mca/gds/hash/gds_hash.c:1721
#7 0x00002aaaab2e2713 in wait_cbfunc (pr=0x100000a4440, hdr=0x100000bc044, buf=0x2aaad16edcb0,
cbdata=0x100000bd3c0) at ../../src/client/pmix_client_spawn.c:345
#8 0x00002aaaab3b6001 in pmix_ptl_base_process_msg (fd=-1, flags=4, cbdata=0x100000bbf70)
at ../../../../src/mca/ptl/base/ptl_base_sendrecv.c:807
#9 0x0000000020023a15 in event_process_active_single_queue (base=base@entry=0x100000a3d10,
activeq=0x100000a4160, max_to_process=max_to_process@entry=2147483647,
endtime=endtime@entry=0x0) at event.c:1646
#10 0x00000000200243ef in event_process_active (base=0x100000a3d10) at event.c:1738
#11 event_base_loop (base=0x100000a3d10, flags=<optimized out>) at event.c:1961
#12 0x00002aaaab33cf3e in progress_engine (obj=0x100000a3c80)
at ../../src/runtime/pmix_progress_threads.c:232
#13 0x00002aaacc837569 in start_thread () from /lib64/libpthread.so.0
#14 0x00002aaaccb46a2f in clone () from /lib64/libc.so.6
Any ideas on what could be happening? Are there any utilities I can run to sanity check my server deployment (i.e., verify PMIx recognizes 2 servers which processes can be invoked on)? Is there a convenient way to get more verbose logging/reporting to see if there are any hints on what the issue is? FWIW, I get identical behavior on another cluster (Cooley system @ ALCF), but it is using rsh
for the PLM as this system uses Cobalt scheduler rather than Slurm (and thus I have to explicitly provide a node list). Maybe I'm just not setting something up properly?
Thank you for taking the time to submit an issue!
Upgraded PMIx and PRRTE, code that was working now crashes
git master @ e37bfeb
ext
git master @ aeb383ba1ecb00515de85450abbb9e1d8e113dd8
git clone
During startup of my OpenSHMEM library I exchange various bits of info, e.g. for symmetric heap addresses/sizes. I think I am now seeing problems with that when using PRRTE. Launch via Open-MPI's mpirun still works. Errors are:
[cn092:134990] [[46042,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../psrvr-git/orte/orted/pmix/pmix_server_pub.c at line 583
[cn092:134990] [[46042,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../psrvr-git/orte/orted/pmix/pmix_server_pub.c at line 583
[cn092:134990] [[46042,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../psrvr-git/orte/orted/pmix/pmix_server_pub.c at line 583
[cn092:134990] [[46042,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../psrvr-git/orte/orted/pmix/pmix_server_pub.c at line 583
[cn092:134990] [[46042,0],1] errmgr:default_orted:proc_errors process [[46042,0],0] error state LIFELINE LOST
[cn092:134990] [[46042,0],1] errmgr:orted lifeline lost or unable to communicate - exiting
Another oddity is that this program is running on 2 nodes, 2 cores-per-node, but I am only seeing one host here for all 4 ranks.
Looks like there have been a number of changes to the PSRVR core code that reflect PMIx v3 support, thereby causing problems when built against an external v2.x, including:
[rhc001:242687] [[545,1],0] ORTE_ERROR_LOG: Not found in file base/ess_base_std_tool.c at line 311
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
store HNP URI failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[rhc001:242687] [[545,1],0] ORTE_ERROR_LOG: Not found in file ess_tool_module.c at line 130
and
[rhc001:242323] ptl:tcp: connecting to server
[rhc001:242323] ptl:tcp:tool searching for session server pmix.rhc001.tool
[rhc001:242323] pmix:tcp: searching directory /tmp
[rhc001:242323] pmix:tcp: ignoring .XIM-unix
[rhc001:242323] pmix:tcp: ignoring .X11-unix
[rhc001:242323] pmix:tcp: ignoring .font-unix
[rhc001:242323] pmix:tcp: ignoring .ICE-unix
[rhc001:242323] pmix:tcp: ignoring .Test-unix
[rhc001:242323] pmix:tcp: ignoring systemd-private-aba1df39ec8b4bc8a77564371fae743e-httpd.service-DlvHDw
[rhc001:242323] pmix:tcp: ignoring am4t8CPKnG
[rhc001:242323] pmix:tcp: ignoring am4tjvHRDk
[rhc001:242323] pmix:tcp: ignoring pmix.sys.rhc001
[rhc001:242323] pmix:tcp: ignoring systemd-private-aba1df39ec8b4bc8a77564371fae743e-rtkit-daemon.service-jIWT6h
[rhc001:242323] pmix:tcp: ignoring systemd-private-aba1df39ec8b4bc8a77564371fae743e-mariadb.service-S4EoAs
[rhc001:242323] pmix:tcp: ignoring .X0-lock
[rhc001:242323] pmix:tcp: ignoring arvjQrOH
[rhc001:242323] pmix:tcp: ignoring yum_save_tx.2018-01-18.03-46.NgOPvZ.yumtx
[rhc001:242323] pmix:tcp: ignoring ompi.rhc001.1000
[rhc001:242323] pmix:tcp: ignoring hsperfdata_root
[rhc001:242323] pmix:tcp: ignoring systemd-private-aba1df39ec8b4bc8a77564371fae743e-chronyd.service-Xo0Sr2
[rhc001:242323] pmix:tcp: ignoring systemd-private-aba1df39ec8b4bc8a77564371fae743e-cups.service-KOOi8o
[rhc001:242323] pmix:tcp: ignoring yum_save_tx.2018-01-17.03-56.4NynkC.yumtx
[rhc001:242323] pmix:tcp: ignoring yum_save_tx.2018-01-19.08-28.XaRItc.yumtx
[rhc001:242323] pmix:tcp: ignoring systemd-private-aba1df39ec8b4bc8a77564371fae743e-colord.service-cx3Kep
[rhc001:242323] OPAL ERROR: Unreachable in file ext2x_client.c at line 240
[rhc001:242323] [[840,0],0] ORTE_ERROR_LOG: Unreachable in file base/ess_base_std_tool.c at line 192
--------------------------------------------------------------------------
and finally, daemon wireup support is busted:
[rhc001:242103] [[108,0],0] ORTE_ERROR_LOG: Not found in file state_dvm.c at line 300
@rhc54 @jjhursey and @jsquyres diagnosed a race condition when a job fails to launch.
E.g.:
$ prte --daemonize
$ prun some_executable_that_emits_stderr_and_fails_immediately
This may hang, and may or may not produce output.
It looks like prun
is still stuck in the PMIX spawn API call. Looking at prte --mca state_base_verbose 5
, it looks like the remote daemons reported the termination correctly, but it looks like there's a race where prun
may not have completed PMIX spawn yet, and therefore somehow missed the termination notification.
@jjhursey said he'd have a look.
PMIX = v3.0.2
PRRTE = 7a34838
CentOS Linux release 7.2.1511 (Core)
Local network over ofi+sockets
As we attempt to switch from orterun to prun (along with updating to more recent pmix compatible with prrte), we encounter an issue of node death notifications not being delivered.
When multiple servers/apps are started in a process group together, and one of them dies/terminates, previously (using orterun and older pmix) we would receive a pmix notification about the death of the set member. We are no longer seeing the same behavior with switch to v3.0.2 of pmix and using prun.
Details:
2 sample servers are started on the same node using prun to start as part of the set of size=2.
1 server kills self, the other one awaits for pmix notification of the death of the other node.
Sample test used by our project which uses PMIX apis for registration:
https://github.com/daos-stack/cart/blob/master/src/test/test_pmix.c
Prior to running the test we start prte as:
prte --daemonize -system-server -H "our_hostname:*"
The actual test is ran as:
prun --continuous -N 2 -x D_LOG_MASK=INFO tests/test_pmix
The test currently times out after not seeing notification of a dead member.
git prrte fc30acb (via latest master of open-mpi/ompi@960c5f7)
pmix4x (via latest ompi master)
Environment is not propagated to remote hosts during the SSH launch. This appears to be an issue with the --enable-orterun-prefix-by-default
at compile time or when using --prefix
at runtime.
Configured with VPATH (via ompi build)
./autogen.pl
cd BUILD-master/
../configure \
--enable-orterun-prefix-by-default \
--prefix=${OMPI_INSTALL_DIR} \
--enable-debug \
&& make -j 4 \
&& make install
Example to reproduce problem
[3t4@node0 BUILD-master]$ hostname
node0
[3t4@node0 BUILD-master]$ more hosts
node1
node2
[3t4@node0 BUILD-master]$ prte --hostfile hosts &
[1] 9107
[3t4@node0 BUILD-master]$ bash: prted: command not found
bash: prted: command not found
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../../prrte/src/mca/rml/oob/rml_oob_send.c at line 202
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../prrte/src/mca/plm/base/plm_base_launch_support.c at line 632
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../../prrte/src/mca/errmgr/dvm/errmgr_dvm.c at line 418
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../../prrte/src/mca/rml/oob/rml_oob_send.c at line 202
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../../prrte/src/mca/errmgr/dvm/errmgr_dvm.c at line 563
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../../prrte/src/mca/rml/oob/rml_oob_send.c at line 202
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../prrte/src/mca/plm/base/plm_base_launch_support.c at line 632
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../../prrte/src/mca/errmgr/dvm/errmgr_dvm.c at line 418
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../../prrte/src/mca/rml/oob/rml_oob_send.c at line 202
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../prrte/src/mca/plm/base/plm_base_launch_support.c at line 632
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../../prrte/src/mca/errmgr/dvm/errmgr_dvm.c at line 418
[1]+ Exit 127 prte --hostfile hosts
[3t4@node0 BUILD-master]$
Thank you for taking the time to submit an issue!
master branch (commit 891a7dd).
PMIx 3.1.2
When doing configure
then make
, the make
command fails immediately with the following error:
CDPATH="${ZSH_VERSION+.}:" && cd . && /bin/bash /home/mdorier/prrte/config/missing aclocal-1.15 -I config
aclocal-1.15: error: config/autogen_found_items.m4:180: file 'orte/mca/schizo/singularity/configure.m4' does not exist
Makefile:878: recipe for target 'aclocal.m4' failed
make: *** [aclocal.m4] Error 1
With multiple applications running back-to-back, the PRTE server crashes at random point. On a few occasions, the server just got stuck and the application launched did not terminate. I am testing this with the Sandia OpenSHMEM unit tests by running "make check" after the server is launched. The issue occurs much less frequently when the PRTE server is launched and terminated for each application separately. Detailed configuration and outputs are mentioned below.
libevent 2.0.22-stable
hwloc 2.0.2
PMIx v3.1 (commit 7680895b0c5dec9b42206ddee35c80fb1683f6ca)
prte (PMIx Reference RTE) 3.0.0rc1
Here are the steps that are leading to this crash, assuming Sandia OpenSHMEM is downloaded and configured in sandia-shmem-basedir
prte &
[1] 7328
DVM Ready
cd sandia-shmem-basedir
make check
Below is the last part of the output that is collected while PRTE was run with "-d" flag.
[node1:03550] [[18102,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_DVM_CLEANUP_JOB_CMD
[node1:03550] [[18102,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
[node1:03550] [[18102,0],0] Releasing job data for [INVALID]
[node1:03550] [[18102,0],0] Releasing job data for [INVALID]
[node1:03550] sess_dir_finalize: proc session dir does not exist
[node1:03550] sess_dir_finalize: job session dir does not exist
[node1:03550] sess_dir_finalize: jobfam session dir not empty - leaving
[node1:03550] sess_dir_finalize: jobfam session dir not empty - leaving
[node1:03550] sess_dir_finalize: top session dir not empty - leaving
[node1:03550] sess_dir_finalize: proc session dir does not exist
[node1:03550] sess_dir_finalize: job session dir does not exist
[node1:03550] sess_dir_finalize: jobfam session dir not empty - leaving
[node1:03550] sess_dir_finalize: jobfam session dir not empty - leaving
[node1:03550] sess_dir_finalize: top session dir not empty - leaving
[node1:03550] [[18102,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_DVM_CLEANUP_JOB_CMD
[node1:03550] sess_dir_finalize: proc session dir does not exist
[node1:03550] sess_dir_finalize: job session dir does not exist
[node1:03550] sess_dir_finalize: jobfam session dir not empty - leaving
[node1:03550] sess_dir_finalize: jobfam session dir not empty - leaving
[node1:03550] sess_dir_finalize: top session dir not empty - leaving
[node1:03550] [[18102,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_DVM_CLEANUP_JOB_CMD
[node1:03550] [[18102,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
[node1:03550] [[18102,0],0] Releasing job data for [INVALID]
[node1:03550] [[18102,0],0] Releasing job data for [INVALID]
[node1:03550] pmix_ptl_base: send_msg: write failed: Bad address (14) [sd = 25]
*** Error in `prte': munmap_chunk(): invalid pointer: 0x00007f24d0053a28 ***
======= Backtrace: =========
/usr/lib64/libc.so.6(+0x7ada4)[0x7f24ddbeada4]
/home/rahmanmd/prrte-install-trial/pmix-3.1/pmix-install/lib/libpmix.so.2(pmix_ptl_base_send_handler+0x3a5)[0x7f24df354a46]
/home/rahmanmd/prrte-install-trial/libevent-2.0.22-stable/libevent-install/lib/libevent-2.0.so.5(event_base_loop+0x812)[0x7f24dee2be82]
/home/rahmanmd/prrte-install-trial/pmix-3.1/pmix-install/lib/libpmix.so.2(+0x94931)[0x7f24df2fb931]
/usr/lib64/libpthread.so.0(+0x7dc5)[0x7f24ddf38dc5]
/usr/lib64/libc.so.6(clone+0x6d)[0x7f24ddc6773d]
Thank you for taking the time to submit an issue!
master@70b72b7ea9f1be771f962d1c3f205f3dab6bf529
master@089187ccb2ff2c10de09c8cc082ec76fadac897e
While trying to refresh my knowledge about PRRTE, I tried to run a simple test: on a single node, start prte
and run a simple prun
command to execute /bin/hostname
.
Here are the commands I used:
$ $HOME/install/prrte_singularity/bin/prte --host localhost
On on a different terminal
$ $HOME/install/prrte_singularity/bin/prun /bin/hostname
which systematically gives me the following error:
[pessoa3:06117] PMIX ERROR: NOT-FOUND in file tool/pmix_tool.c at line 250
Am I doing something wrong?
once in a while, prun -terminate
hangs.
I can currently reproduce the issue with the latest PMIx v3.0
and my customized prrte
from https://github.com/ggouaillardet/prrte/tree/topic/pmix2
Here are some traces
(gdb) info threads
Id Target Id Frame
4 Thread 0x7fb7c1cd5700 (LWP 18785) "prun" 0x00007fb7c36886d3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
3 Thread 0x7fb7bfa3a700 (LWP 18786) "prun" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
2 Thread 0x7fb7bf239700 (LWP 18787) "prun" 0x00007fb7c367f913 in select () at ../sysdeps/unix/syscall-template.S:81
* 1 Thread 0x7fb7c4834740 (LWP 18784) "prun" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
(gdb) bt
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fb7c438e618 in PMIx_tool_finalize () at tool/pmix_tool.c:1143
#2 0x0000000000408b29 in prun (argc=2, argv=0x7ffcbbcbfd68) at prun.c:1229
#3 0x0000000000402e6d in main (argc=2, argv=0x7ffcbbcbfd68) at main.c:13
(gdb) p pmix_globals.connected
$1 = false
(gdb) f 1
#1 0x00007fb7c438e618 in PMIx_tool_finalize () at tool/pmix_tool.c:1143
1143 PMIX_WAIT_THREAD(&tev.lock);
(gdb) l
1138 }
1139 return rc;
1140 }
1141
1142 /* wait for the ack to return */
1143 PMIX_WAIT_THREAD(&tev.lock);
1144 PMIX_DESTRUCT_LOCK(&tev.lock);
1145 if (tev.active) {
1146 pmix_event_del(&tev.ev);
1147 }
(gdb) thread 3
[Switching to thread 3 (Thread 0x7fb7bfa3a700 (LWP 18786))]
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135 ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007fb7c395fdb0 in pthread_cond_broadcast@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_broadcast.S:136
#2 0x0000000000403f86 in evhandler (evhdlr_registration_id=0, status=-101, source=0x7fb7b800131c, info=0x7fb7b80017d0, ninfo=1, results=0x0, nresults=0,
cbfunc=0x7fb7c430aee8 <progress_local_event_hdlr>, cbdata=0x7fb7b8001240) at prun.c:346
#3 0x00007fb7c430d21e in pmix_invoke_local_event_hdlr (chain=0x7fb7b8001240) at event/pmix_event_notification.c:738
#4 0x00007fb7c430fcb6 in pmix_event_timeout_cb (fd=-1, flags=1, arg=0x7fb7b8001240) at event/pmix_event_notification.c:1143
#5 0x00007fb7c3b7ef24 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5
#6 0x00007fb7c4384d85 in progress_engine (obj=0x1584cd8) at runtime/pmix_progress_threads.c:109
#7 0x00007fb7c395b184 in start_thread (arg=0x7fb7bfa3a700) at pthread_create.c:312
#8 0x00007fb7c368803d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) f 2
#2 0x0000000000403f86 in evhandler (evhdlr_registration_id=0, status=-101, source=0x7fb7b800131c, info=0x7fb7b80017d0, ninfo=1, results=0x0, nresults=0,
cbfunc=0x7fb7c430aee8 <progress_local_event_hdlr>, cbdata=0x7fb7b8001240) at prun.c:346
346 OPAL_PMIX_WAKEUP_THREAD(lock);
(gdb) l
341 lock->status = jobstatus;
342 if (NULL != msg) {
343 lock->msg = strdup(msg);
344 }
345 /* release the lock */
346 OPAL_PMIX_WAKEUP_THREAD(lock);
347
348 /* we _always_ have to execute the evhandler callback or
349 * else the event progress engine will hang */
350 if (NULL != cbfunc) {
I think the race occurs when PMIx_tool_finalize()
is invoked when pmix_globals.connected
is true
but becomes false
in the middle of it.
The suprising thing is prun
is notified about it (since evhandler()
is invoked with status=PMIX_ERR_LOST_CONNECTION_TO_SERVER
) and does not take any action which is an other story.
The odd thing is PMIX_PTL_SEND_RECV
did not invoke the finwait_cbfunc()
callback at all.
All that being said, should we really care ?
I mean that in the case of prun -terminate
, could we simply exit(0)
after PMIx_Job_control_fn()
callback is invoked and not call PMIx_tool_finalize()
at all ?
The first time you invoke a job that spawns another job, the IO from the child job is properly forwarded and output by the parent. However, if you invoke the job again, the IO from the child job is lost. It appears that something in the IOF gets confused and left in a "do not forward" state.
PRRTE master at a9ef1f5
PMIx master at openpmix/openpmix@a3cfa97
On job completion sometimes I see a message like the below repeated over and over again until the job is killed:
[node01:33166] Read -1, expected 1048576, errno = 14
[node01:33166] Read -1, expected 1048576, errno = 14
[node01:33166] Read -1, expected 1048576, errno = 14
I suspect that this is a race between the prte
daemon closing the channel to the prun
process and the prun
process deregistering it from libevent.
It is a difficult timing window to hit in the wild, but regressing testing in Open MPI via MTT hits this nightly due to the large volume of tests being run.
Thank you for taking the time to submit an issue!
Latest update to PRRTE causes prte crash.
git master @ 5c02d19
git master @ cf043ddce69cab9f874777bb2bef60e1ea9465d2
prte
crashes as soon as a client contacts it.
$ prte
DVM ready
<wait for prun client>
[cn-mem:07946] *** Process received signal ***
[cn-mem:07946] Signal: Segmentation fault (11)
[cn-mem:07946] Signal code: Address not mapped (1)
[cn-mem:07946] Failing at address: (nil)
[cn-mem:07946] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2aaaacc05370]
[cn-mem:07946] [ 1] /lib64/libc.so.6(vsnprintf+0x62)[0x2aaaace86162]
[cn-mem:07946] [ 2] /lib64/libc.so.6(snprintf+0x82)[0x2aaaace63912]
[cn-mem:07946] [ 3] /gpfs/home/arcurtis/opt/prrte/git/lib/libprteopen-rte.so.0(orte_util_snprintf_jobid+0x19)[0x2aaaaacf7329]
[cn-mem:07946] [ 4] /gpfs/home/arcurtis/opt/prrte/git/lib/libprteopen-rte.so.0(orte_pmix_server_register_nspace+0x2c8a)[0x2aaaaad1553a]
[cn-mem:07946] [ 5] /gpfs/home/arcurtis/opt/prrte/git/lib/libprteopen-rte.so.0(orte_odls_base_default_construct_child_list+0x13ee)[0x2aaaaad3b35e]
[cn-mem:07946] [ 6] /gpfs/home/arcurtis/opt/prrte/git/lib/pmix/mca_odls_default.so(+0x253e)[0x2aaab3a1d53e]
[cn-mem:07946] [ 7] /gpfs/home/arcurtis/opt/prrte/git/lib/libprteopen-rte.so.0(orte_daemon_recv+0xabb)[0x2aaaaad06f6b]
[cn-mem:07946] [ 8] /gpfs/home/arcurtis/opt/prrte/git/lib/libprteopen-rte.so.0(orte_rml_base_process_msg+0x14b)[0x2aaaaad5e2eb]
[cn-mem:07946] [ 9] /gpfs/projects/ChapmanGroup/opt/libevent/lib/libevent-2.1.so.6(+0x2153d)[0x2aaaab90d53d]
[cn-mem:07946] [10] /gpfs/projects/ChapmanGroup/opt/libevent/lib/libevent-2.1.so.6(event_base_loop+0x3ef)[0x2aaaab90dc4f]
[cn-mem:07946] [11] prte[0x4028ad]
[cn-mem:07946] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaace33b35]
[cn-mem:07946] [13] prte[0x401b29]
[cn-mem:07946] *** End of error message ***
Segmentation fault (core dumped)
The order of processing for the loops that span objects vs procs on a node was reversed at some point. While the change is certainly faster (it put the larger loop on the outside of a smaller one), it unfortunately yields the wrong answer.
There are a number of things that need to be done to complete the OMPI integration effort. I'm going to list them here for tracking purposes and in the hope that others might pick some of them up. If you do, please edit this comment and put your name at the beginning of the item you are working on so we avoid duplicate effort. Obviously, there will be some "ompi" items in this list. This is a "living" list, so expect more things to be added as they are identified.
[@rhc54] Revise command line setup/parsing. Need to expand it a bit to allow for multiple command line definitions. Need to handle different MCA params for OMPI vs PRRTE.
Singleton support. IIRC, I enabled PMIx_Init to support singletons - i.e., when the client is not launched by a daemon and thus has no contact information for a PMIx server. However, I didn't do anything about the case of singleton comm_spawn where the client needs to start a PMIx server and then connect back to it.
Resolve reported comm_spawn issues. Multiple reports of comm_spawn problems on the OMPI mailing lists and issues. Includes missing support for various MPI_Info arguments such as "add_hostfile" that may (likely) require some updates to PRRTE
Decide what to do about legacy ORTE MCA params. These probably need to be detected and converted to their PRRTE equivalent
Update PRRTE frameworks to use MCA params solely for setting default behavior, overridden on a per-job basis by user specifications.
[@jsquyres] Come up with a way for "ompi_info" to include PRRTE information
Resolve multi-mpirun connect/accept issues - do we auto-detect the presence of another DVM and launch within it, or do we launch a 2nd DVM and "connect" between them, or...?
Devise support for user obtaining an MPI "port", printing it out, and then feeding it to another mpirun on the cmd line for connect/accept
Thank you for taking the time to submit an issue!
Install fails due to missing file
github master @ 2d99acb
github master @ d2473b0e641709bb8395823d07d39726543486ae
$ make install
...
...
...
make[3]: Entering directory '/home/arcurtis/src/prrte/build/src/etc'
make[3]: Nothing to be done for 'install-exec-am'.
/usr/bin/mkdir -p /home/arcurtis/opt/prrte/git/etc
******************************* WARNING ************************************
*** Not installing new prrte-mca-params.conf over existing file in:
*** /home/arcurtis/opt/prrte/git/etc/prrte-mca-params.conf
******************************* WARNING ************************************
/usr/bin/install: cannot stat 'prrte-default-hostfile': No such file or directory
make[3]: *** [Makefile:866: install-data-local] Error 1
make[3]: Leaving directory '/home/arcurtis/src/prrte/build/src/etc'
make[2]: *** [Makefile:749: install-am] Error 2
make[2]: Leaving directory '/home/arcurtis/src/prrte/build/src/etc'
make[1]: *** [Makefile:1862: install-recursive] Error 1
make[1]: Leaving directory '/home/arcurtis/src/prrte/build/src'
make: *** [Makefile:843: install-recursive] Error 1
Thank you for taking the time to submit an issue!
Latest git HEAD master doesn't compile when Torque found (as in --with-tm
)
git HEAD master c03469c
git HEAD master f51832d50f1c69c38f14c968167939cb00ad9482
Compile error when Torque detected (not actually using it, but still there in environment). Compilation successful with --without-tm
.
make[2]: Entering directory `/gpfs/projects/ChapmanGroup/src/prrte/build/src/mca/plm/tm'
CC plm_tm_component.lo
CC plm_tm_module.lo
../../../../../prrte-git/src/mca/plm/tm/plm_tm_component.c:34:39: fatal error: src/mca/base/mca_base_var.h: No such file or directory
#include "src/mca/base/mca_base_var.h"
^
compilation terminated.
../../../../../prrte-git/src/mca/plm/tm/plm_tm_module.c: In function 'launch_daemons':
../../../../../prrte-git/src/mca/plm/tm/plm_tm_module.c:297:5: warning: implicit declaration of function 'mca_base_var_env_name' [-Wimplicit-function-declaration]
(void) mca_base_var_env_name ("plm", &var);
^
make[2]: *** [plm_tm_component.lo] Error 1
Thank you for taking the time to submit an issue!
Just installed latest PMIx update from github, prrte now segfaults.
(Wes [wessle] is working with me, BTW, this is all related)
git master @ ffe3dd3
External
git master @ a1d3610c2b0eadf68948eead2ec64fc29d799a9e
git clone
Run prrte to get DVM, then
$ prun -n 1 pmix-client-program
generates this from prrte:
(gdb) r
Starting program: /opt/prrte/bin/prte
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff1c07700 (LWP 52013)]
[New Thread 0x7ffff0dee700 (LWP 52014)]
[New Thread 0x7fffefbcb700 (LWP 52015)]
[New Thread 0x7fffef3ca700 (LWP 52016)]
DVM ready
Thread 2 "prte" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff1c07700 (LWP 52013)]
query_cbfunc (status=<optimized out>, status@entry=0, info=info@entry=0x0,
ninfo=<optimized out>, ninfo@entry=0, cbdata=0x7fffe800db70,
release_fn=release_fn@entry=0x0, release_cbdata=release_cbdata@entry=0x0)
at ../../pmix-git/src/server/pmix_server.c:2748
2748 ../../pmix-git/src/server/pmix_server.c: No such file or directory.
(gdb) bt
#0 query_cbfunc (status=<optimized out>, status@entry=0, info=info@entry=0x0,
ninfo=<optimized out>, ninfo@entry=0, cbdata=0x7fffe800db70,
release_fn=release_fn@entry=0x0, release_cbdata=release_cbdata@entry=0x0)
at ../../pmix-git/src/server/pmix_server.c:2748
#1 0x00007ffff741d3ce in pmix_server_job_ctrl (
peer=peer@entry=0x7fffe800e9c0, buf=buf@entry=0x7ffff1c06c80,
cbfunc=cbfunc@entry=0x7ffff73fd1a0 <query_cbfunc>, cbdata=<optimized out>)
at ../../pmix-git/src/server/pmix_server_ops.c:2541
#2 0x00007ffff7402f4a in server_switchyard (peer=peer@entry=0x7fffe800e9c0,
tag=101, buf=buf@entry=0x7ffff1c06c80)
at ../../pmix-git/src/server/pmix_server.c:3196
#3 0x00007ffff7403897 in pmix_server_message_handler (pr=0x7fffe800e9c0,
hdr=0x7fffe800da08, buf=0x7ffff1c06c80, cbdata=<optimized out>)
at ../../pmix-git/src/server/pmix_server.c:3246
#4 0x00007ffff746b1be in pmix_ptl_base_process_msg (fd=<optimized out>,
flags=<optimized out>, cbdata=0x7fffe800d930)
at ../../../../pmix-git/src/mca/ptl/base/ptl_base_sendrecv.c:719
#5 0x00007ffff6f74345 in event_process_active_single_queue (
base=base@entry=0x6ca840, activeq=0x6cac90,
max_to_process=max_to_process@entry=2147483647, endtime=endtime@entry=0x0)
at event.c:1646
#6 0x00007ffff6f74d47 in event_process_active (base=0x6ca840) at event.c:1738
#7 event_base_loop (base=0x6ca840, flags=flags@entry=1) at event.c:1961
#8 0x00007ffff7427fde in progress_engine (obj=<optimized out>)
at ../../pmix-git/src/runtime/pmix_progress_threads.c:109
#9 0x00007ffff634c594 in start_thread (arg=<optimized out>)
at pthread_create.c:463
#10 0x00007ffff60800df in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)
Thank you for taking the time to submit an issue!
Trying to trap a exit codes from an app launched via prun (and prte).
git master @093174a070fbcf76bb00df78c7d4f4b0ef3c4da8
External
git master @35ae8a0320c32023fb1a2ab087352d07e888bfc9
source
If the app exit(1)s or equivalent, can I get this back from prun? prun always seems to exit(0)
prrte version: master @ 1aec6c4
external
pmix : master @ be15631db82cf9b3fd5078f1336812de0b500838
git clone
When i use pcc to compile the ompi/examples, it not using the external pmix.h.
with $pcc --show-me
the -I/external_pmix_install_path/include is missing.
With latest prte
and pmix
from their master
branches.
I incorrectly passed the -c
flag to pcc
and hence generated an object file when I really meant to generate a binary. Then I tried to prun
it and ended up with a non understandable error message
$ ~/local/prrte/bin/pcc -c -g -O0 -o hello hello.c
$ ~/local/prrte/bin/prun -n 1 ./hello
[c7:31224] Job failed to spawn: ERROR STRING NOT FOUND
The bash error is a Permission denied
(since the executable bit is not set on an object), and I would expect a similar error is reported by prun
.
Thank you for taking the time to submit an issue!
After git-pull and new install of PRRTE today, "prte" now exits immediately on our local cluster, both outside, and inside, of PBS jobs. It continues to work OK on 2 other standalone machines.
git master @ e37bfeb
external
git master @ aeb383ba1ecb00515de85450abbb9e1d8e113dd8
git clone [email protected]:pmix/pmix.git
Running "prte" exits immediately with status -43, instead of waiting at "DVM ready". Debugging output below
$ (prte -d --pmca ess_base_verbose 1000 ; echo $?) |& cat -n
1 [login:01545] mca: base: components_register: registering framework ess components
2 [login:01545] mca: base: components_register: found loaded component tm
3 [login:01545] mca: base: components_register: component tm has no register or open function
4 [login:01545] mca: base: components_register: found loaded component env
5 [login:01545] mca: base: components_register: component env has no register or open function
6 [login:01545] mca: base: components_register: found loaded component hnp
7 [login:01545] mca: base: components_register: component hnp has no register or open function
8 [login:01545] mca: base: components_register: found loaded component slurm
9 [login:01545] mca: base: components_register: component slurm has no register or open function
10 [login:01545] mca: base: components_open: opening ess components
11 [login:01545] mca: base: components_open: found loaded component tm
12 [login:01545] mca: base: components_open: component tm open function successful
13 [login:01545] mca: base: components_open: found loaded component env
14 [login:01545] mca: base: components_open: component env open function successful
15 [login:01545] mca: base: components_open: found loaded component hnp
16 [login:01545] mca: base: components_open: component hnp open function successful
17 [login:01545] mca: base: components_open: found loaded component slurm
18 [login:01545] mca: base: components_open: component slurm open function successful
19 [login:01545] mca:base:select: Auto-selecting ess components
20 [login:01545] mca:base:select:( ess) Querying component [tm]
21 [login:01545] mca:base:select:( ess) Querying component [env]
22 [login:01545] mca:base:select:( ess) Querying component [hnp]
23 [login:01545] mca:base:select:( ess) Query of component [hnp] set priority to 100
24 [login:01545] mca:base:select:( ess) Querying component [slurm]
25 [login:01545] mca:base:select:( ess) Selected component [hnp]
26 [login:01545] mca: base: close: component tm closed
27 [login:01545] mca: base: close: unloading component tm
28 [login:01545] mca: base: close: component env closed
29 [login:01545] mca: base: close: unloading component env
30 [login:01545] mca: base: close: component slurm closed
31 [login:01545] mca: base: close: unloading component slurm
32 [login:01545] procdir: /tmp/ompi.login.170008941/dvm/0/0
33 [login:01545] jobdir: /tmp/ompi.login.170008941/dvm/0
34 [login:01545] top: /tmp/ompi.login.170008941/dvm
35 [login:01545] top: /tmp/ompi.login.170008941
36 [login:01545] tmp: /tmp
37 [login:01545] sess_dir_cleanup: job session dir does not exist
38 [login:01545] sess_dir_cleanup: top session dir does not exist
39 [login:01545] procdir: /tmp/ompi.login.170008941/dvm/0/0
40 [login:01545] jobdir: /tmp/ompi.login.170008941/dvm/0
41 [login:01545] top: /tmp/ompi.login.170008941/dvm
42 [login:01545] top: /tmp/ompi.login.170008941
43 [login:01545] tmp: /tmp
44 [login:01545] sess_dir_finalize: proc session dir does not exist
45 [login:01545] sess_dir_finalize: job session dir does not exist
46 [login:01545] sess_dir_finalize: jobfam session dir not empty - leaving
47 [login:01545] sess_dir_finalize: jobfam session dir not empty - leaving
48 [login:01545] sess_dir_finalize: top session dir not empty - leaving
49 [login:01545] sess_dir_cleanup: job session dir does not exist
50 [login:01545] sess_dir_cleanup: found top session dir empty - deleting
51 213
Expecting system info and then "DVM ready" after line 43. Guessing problem in HNP, how to dig deeper?
Just an issue for tracking the stabilization effort
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.