Code Monkey home page Code Monkey logo

prrte's Introduction

PMIx Reference RunTime Environment (PRRTE)

PRRTE is the PMIx Reference RunTime Environment

Official documentation

The PRRTE documentation can be viewed in the following ways:

  1. Online at https://docs.prrte.org/
  2. In self-contained (i.e., suitable for local viewing, without an internet connection) in official distribution tarballs under docs/_build/html/index.html.

Building the documentation locally

The source code for PRRTE's docs can be found in the PRRTE Git repository under the docs folder.

Developers who clone the PRRTE Git repository will not have the HTML documentation and man pages by default; it must be built. Instructions for how to build the PRRTE documentation can be found here: https://docs.prrte.org/en/latest/developers/sphinx.html

prrte's People

Contributors

abouteiller avatar alex-mikheev avatar artpol84 avatar bosilca avatar bwbarrett avatar ddaniel avatar edgargabriel avatar ggouaillardet avatar goodell avatar gshipman avatar hjelmn avatar hpcraink avatar hppritcha avatar igor-ivanov avatar jjhursey avatar jladd-mlnx avatar jsquyres avatar jurenz avatar kawashima-fj avatar mike-dubman avatar nysal avatar rhc54 avatar rlgraham32 avatar rolfv avatar samuelkgutierrez avatar timattox avatar tkordenbrock avatar vvenkates27 avatar yburette avatar yosefe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

prrte's Issues

prte crashes immediately since recent PRRTE update

Thank you for taking the time to submit an issue!

Background information

prte crashes on startup

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

PRRTE git master @ 891a7dd

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

PMIx git master @ 257f6b4c9aced263824a4273996678985bea5d0d

Please describe the system on which you are running

  • Operating system/version: CentOS 7
  • Computer hardware: x86_64
  • Network type: IB

but happening on other machines/platforms too


Details of the problem

shell$ prte
[login2:57281] *** Process received signal ***
[login2:57281] Signal: Segmentation fault (11)
[login2:57281] Signal code: Address not mapped (1)
[login2:57281] Failing at address: 0x30
[login2:57281] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x2aaaacd256d0]
[login2:57281] [ 1] /gpfs/projects/ChapmanGroup/opt/pmix/git/lib/pmix/mca_pnet_tcp.so(+0x2403)[0x2aaab1d98403]
[login2:57281] [ 2] /gpfs/projects/ChapmanGroup/opt/pmix/git/lib/libpmix.so.0(pmix_pnet_base_select+0xd0)[0x2aaaab33d8e0]
[login2:57281] [ 3] /gpfs/projects/ChapmanGroup/opt/pmix/git/lib/libpmix.so.0(PMIx_server_init+0x741)[0x2aaaab2c2151]
[login2:57281] [ 4] /gpfs/projects/ChapmanGroup/opt/prrte/git/lib/libprteopen-rte.so.0(pmix_server_init+0x8ba)[0x2aaaaad1ef3a]
[login2:57281] [ 5] /gpfs/projects/ChapmanGroup/opt/prrte/git/lib/pmix/mca_ess_hnp.so(+0x4700)[0x2aaaae142700]
[login2:57281] [ 6] /gpfs/projects/ChapmanGroup/opt/prrte/git/lib/libprteopen-rte.so.0(orte_init+0x2c6)[0x2aaaaace51a6]
[login2:57281] [ 7] prte[0x4024ee]
[login2:57281] [ 8] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaacf54445]
[login2:57281] [ 9] prte[0x401d19]
[login2:57281] *** End of error message ***
Segmentation fault (core dumped)

gdb says:

shell$ gdb `which prte`
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /gpfs/projects/ChapmanGroup/opt/prrte/git/bin/prte...done.
(gdb) r
Starting program: /gpfs/projects/ChapmanGroup/opt/prrte/git/bin/prte
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: File "/gpfs/projects/ChapmanGroup/opt/gcc/git/lib64/libstdc++.so.6.0.26-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py".
To enable execution of this file add
	add-auto-load-safe-path /gpfs/projects/ChapmanGroup/opt/gcc/git/lib64/libstdc++.so.6.0.26-gdb.py
line to your configuration file "/gpfs/home/arcurtis/.gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/gpfs/home/arcurtis/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
[New Thread 0x2aaaae13d700 (LWP 66951)]

Program received signal SIGSEGV, Segmentation fault.
0x00002aaab1d98403 in tcp_finalize ()
   from /gpfs/projects/ChapmanGroup/opt/pmix/git/lib/pmix/mca_pnet_tcp.so
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.170-4.el7.x86_64 elfutils-libs-0.170-4.el7.x86_64 glibc-2.17-222.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-9.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 openssl-libs-1.0.2k-12.el7.x86_64 systemd-libs-219-57.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00002aaab1d98403 in tcp_finalize ()
   from /gpfs/projects/ChapmanGroup/opt/pmix/git/lib/pmix/mca_pnet_tcp.so
#1  0x00002aaaab33d8e0 in pmix_pnet_base_select ()
   from /gpfs/projects/ChapmanGroup/opt/pmix/git/lib/libpmix.so.0
#2  0x00002aaaab2c2151 in PMIx_server_init ()
   from /gpfs/projects/ChapmanGroup/opt/pmix/git/lib/libpmix.so.0
#3  0x00002aaaaad1ef3a in pmix_server_init ()
   from /gpfs/projects/ChapmanGroup/opt/prrte/git/lib/libprteopen-rte.so.0
#4  0x00002aaaae142700 in rte_init ()
   from /gpfs/projects/ChapmanGroup/opt/prrte/git/lib/pmix/mca_ess_hnp.so
#5  0x00002aaaaace51a6 in orte_init ()
   from /gpfs/projects/ChapmanGroup/opt/prrte/git/lib/libprteopen-rte.so.0
#6  0x00000000004024ee in main (argc=1, argv=0x7fffffffaf48)
    at ../../../../prrte-git/orte/tools/prte/prte.c:369

prun: ensure non-zero exit code if fail MAPPING

Background information

  • PRRTE master @ e93c77c
  • PMIX tested using pmix-3.1.3
  • Test machine:
    • Operating system/version: Linux ubuntu 16.04
    • Computer hardware: x86-64
    • Network type: ethernet

Details of the problem

When a mapping fails the retval from prun was set to success (0) instead of an abnormal termination value (non zero).

This can be reproduced like this:

 # Ask for more procs than you have slots
shell$ prun -np 2 -host localhost ./hello_world ; echo $?

ras:lsf compile errors with master

Thank you for taking the time to submit an issue!

Background information

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

master @ 2a0539a

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

pmix-3.1.4

Please describe the system on which you are running

  • Operating system/version: Linux
  • Computer hardware: POWER9 (Summit)
  • Network type: MLNX

Details of the problem

Compile error in ras:lsf component when building on Summit.

make[2]: Entering directory `/autofs/nccs-svm1_sw/summit/ums/ompix/DEVELOP/gcc/6.4.0/build/prrte-br-master/orte/mca/ras/lsf'
  CC       ras_lsf_module.lo
In file included from ../../../../../../../../source/prrte-br-master/orte/mca/ras/lsf/ras_lsf_module.c:37:0:
../../../../../../../../source/prrte-br-master/orte/mca/ras/lsf/ras_lsf_module.c: In function 'allocate':
../../../../../../../../source/prrte-br-master/orte/mca/ras/lsf/ras_lsf_module.c:122:65: error: 'orte_rmaps_base' undeclared (first use in this function)
     } else if ((ORTE_MAPPING_GIVEN & ORTE_GET_MAPPING_DIRECTIVE(orte_rmaps_base.mapping)) ||
                                                                 ^
../../../../../../../../source/prrte-br-master/orte/mca/rmaps/rmaps_types.h:103:7: note: in definition of macro 'ORTE_GET_MAPPING_DIRECTIVE'
     ((pol) & 0xff00)
       ^~~
../../../../../../../../source/prrte-br-master/orte/mca/ras/lsf/ras_lsf_module.c:122:65: note: each undeclared identifier is reported only once for each function it appears in
     } else if ((ORTE_MAPPING_GIVEN & ORTE_GET_MAPPING_DIRECTIVE(orte_rmaps_base.mapping)) ||
                                                                 ^
../../../../../../../../source/prrte-br-master/orte/mca/rmaps/rmaps_types.h:103:7: note: in definition of macro 'ORTE_GET_MAPPING_DIRECTIVE'
     ((pol) & 0xff00)
       ^~~
../../../../../../../../source/prrte-br-master/orte/mca/ras/lsf/ras_lsf_module.c:123:70: error: expected ')' before '{' token
                OPAL_BINDING_POLICY_IS_SET(opal_hwloc_binding_policy) {
                                                                      ^
../../../../../../../../source/prrte-br-master/orte/mca/ras/lsf/ras_lsf_module.c:174:1: error: expected expression before '}' token
 }
 ^
make[2]: *** [ras_lsf_module.lo] Error 1
make[2]: Leaving directory `/autofs/nccs-svm1_sw/summit/ums/ompix/DEVELOP/gcc/6.4.0/build/prrte-br-master/orte/mca/ras/lsf'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/autofs/nccs-svm1_sw/summit/ums/ompix/DEVELOP/gcc/6.4.0/build/prrte-br-master/orte'
make: *** [all-recursive] Error 1

getenv() now always returns NULL in client code

Thank you for taking the time to submit an issue!

Background information

I use getenv() in my OpenSHMEM library which launches through PMIx. now with PRRTE as launcher getenv() always returns NULL despite environment variables being set.

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

git master @ 164ab7f

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

git master @ aa6fb1e3b2b5a340c427960b02d06c6ffa01bdc4

Please describe the system on which you are running

  • Operating system/version: CentOS 7.5
  • Computer hardware: x86_64
  • Network type: Ib

Details of the problem

MWE below. When called through prte/prun, getenv() returns NULL despite VERBOSITY=1 in environment. Open-MPI as launcher picks up a string for "VERBOSITY".

#include <stdio.h>
#include <stdlib.h>

#include <pmix.h>

int
main()
{
    pmix_proc_t p;

    PMIx_Init(&p, NULL, 0);

    char *v = getenv("VERBOSITY");

    printf("%d: v = %p\n", p.rank, v);

    PMIx_Finalize(NULL, 0);

    return 0;
}

PRTE git HEAD master exiting immediately on one platform

Thank you for taking the time to submit an issue!

Background information

After git-pull and new install of PRRTE today, "prte" now exits immediately on our local cluster, both outside, and inside, of PBS jobs. It continues to work OK on 2 other standalone machines.

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

git master @ e37bfeb

Is the reference server using its internal version of PMIx, or an external one?

external

If external, what version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

git master @ aeb383ba1ecb00515de85450abbb9e1d8e113dd8

Describe how PMIx was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone [email protected]:pmix/pmix.git

Please describe the system on which you are running

  • Operating system/version: CentOS 7
  • Computer hardware: Penguin, Xeon E5-2650
  • Network type: ethernet, infiniband (some mlx4, some mlx5)

Details of the problem

Running "prte" exits immediately with status -43, instead of waiting at "DVM ready". Debugging output below

$ (prte -d --pmca ess_base_verbose 1000 ; echo $?) |& cat -n
     1	[login:01545] mca: base: components_register: registering framework ess components
     2	[login:01545] mca: base: components_register: found loaded component tm
     3	[login:01545] mca: base: components_register: component tm has no register or open function
     4	[login:01545] mca: base: components_register: found loaded component env
     5	[login:01545] mca: base: components_register: component env has no register or open function
     6	[login:01545] mca: base: components_register: found loaded component hnp
     7	[login:01545] mca: base: components_register: component hnp has no register or open function
     8	[login:01545] mca: base: components_register: found loaded component slurm
     9	[login:01545] mca: base: components_register: component slurm has no register or open function
    10	[login:01545] mca: base: components_open: opening ess components
    11	[login:01545] mca: base: components_open: found loaded component tm
    12	[login:01545] mca: base: components_open: component tm open function successful
    13	[login:01545] mca: base: components_open: found loaded component env
    14	[login:01545] mca: base: components_open: component env open function successful
    15	[login:01545] mca: base: components_open: found loaded component hnp
    16	[login:01545] mca: base: components_open: component hnp open function successful
    17	[login:01545] mca: base: components_open: found loaded component slurm
    18	[login:01545] mca: base: components_open: component slurm open function successful
    19	[login:01545] mca:base:select: Auto-selecting ess components
    20	[login:01545] mca:base:select:(  ess) Querying component [tm]
    21	[login:01545] mca:base:select:(  ess) Querying component [env]
    22	[login:01545] mca:base:select:(  ess) Querying component [hnp]
    23	[login:01545] mca:base:select:(  ess) Query of component [hnp] set priority to 100
    24	[login:01545] mca:base:select:(  ess) Querying component [slurm]
    25	[login:01545] mca:base:select:(  ess) Selected component [hnp]
    26	[login:01545] mca: base: close: component tm closed
    27	[login:01545] mca: base: close: unloading component tm
    28	[login:01545] mca: base: close: component env closed
    29	[login:01545] mca: base: close: unloading component env
    30	[login:01545] mca: base: close: component slurm closed
    31	[login:01545] mca: base: close: unloading component slurm
    32	[login:01545] procdir: /tmp/ompi.login.170008941/dvm/0/0
    33	[login:01545] jobdir: /tmp/ompi.login.170008941/dvm/0
    34	[login:01545] top: /tmp/ompi.login.170008941/dvm
    35	[login:01545] top: /tmp/ompi.login.170008941
    36	[login:01545] tmp: /tmp
    37	[login:01545] sess_dir_cleanup: job session dir does not exist
    38	[login:01545] sess_dir_cleanup: top session dir does not exist
    39	[login:01545] procdir: /tmp/ompi.login.170008941/dvm/0/0
    40	[login:01545] jobdir: /tmp/ompi.login.170008941/dvm/0
    41	[login:01545] top: /tmp/ompi.login.170008941/dvm
    42	[login:01545] top: /tmp/ompi.login.170008941
    43	[login:01545] tmp: /tmp
    44	[login:01545] sess_dir_finalize: proc session dir does not exist
    45	[login:01545] sess_dir_finalize: job session dir does not exist
    46	[login:01545] sess_dir_finalize: jobfam session dir not empty - leaving
    47	[login:01545] sess_dir_finalize: jobfam session dir not empty - leaving
    48	[login:01545] sess_dir_finalize: top session dir not empty - leaving
    49	[login:01545] sess_dir_cleanup: job session dir does not exist
    50	[login:01545] sess_dir_cleanup: found top session dir empty - deleting
    51	213

Expecting system info and then "DVM ready" after line 43. Guessing problem in HNP, how to dig deeper?

Singletons?

Currently, the only way for an application process to connect to a PMIx server is by being spawned by that server - only tools have the logic to "discover" a PMIx server. This begs the question for PRRTE: how do we support singleton operations?

One possibility is to modify the client code to match that of a tool - i.e., if not given contact info, then search for it. However, this does beg some security issues that we deal with for tools, but not necessarily for apps. It also begins to blur the distinction between the two categories.

Another option would be to have the singleton spin off its own "prun" to support it ala what ORTE did - but that always left a sour taste in my mouth.

Any thoughts? My personal leaning would be to allow singletons to self-discover the local server, but to identify themselves as an app instead of a tool to make their intent clear for future places where we might want to differentiate them. For example, we allow a tool to drop a rendezvous file for subsequent attachment, but we don't provide that ability to an app.

@jjhursey @ggouaillardet ?

Compile error when Torque/PBS detected

Thank you for taking the time to submit an issue!

Background information

Latest git HEAD master doesn't compile when Torque found (as in --with-tm)

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

git HEAD master c03469c

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

git HEAD master f51832d50f1c69c38f14c968167939cb00ad9482

Please describe the system on which you are running

  • Operating system/version: CentOS 7
  • Computer hardware: x86_64
  • Network type: IB

Details of the problem

Compile error when Torque detected (not actually using it, but still there in environment). Compilation successful with --without-tm.

make[2]: Entering directory `/gpfs/projects/ChapmanGroup/src/prrte/build/src/mca/plm/tm'
  CC       plm_tm_component.lo
  CC       plm_tm_module.lo
../../../../../prrte-git/src/mca/plm/tm/plm_tm_component.c:34:39: fatal error: src/mca/base/mca_base_var.h: No such file or directory
 #include "src/mca/base/mca_base_var.h"
                                       ^
compilation terminated.
../../../../../prrte-git/src/mca/plm/tm/plm_tm_module.c: In function 'launch_daemons':
../../../../../prrte-git/src/mca/plm/tm/plm_tm_module.c:297:5: warning: implicit declaration of function 'mca_base_var_env_name' [-Wimplicit-function-declaration]
     (void) mca_base_var_env_name ("plm", &var);
     ^
make[2]: *** [plm_tm_component.lo] Error 1

Node death notification events not delivered with PMIX v3.0.2

Background information

PMIX = v3.0.2
PRRTE = 7a34838

Please describe the system on which you are running

CentOS Linux release 7.2.1511 (Core)
Local network over ofi+sockets


Details of the problem

As we attempt to switch from orterun to prun (along with updating to more recent pmix compatible with prrte), we encounter an issue of node death notifications not being delivered.

When multiple servers/apps are started in a process group together, and one of them dies/terminates, previously (using orterun and older pmix) we would receive a pmix notification about the death of the set member. We are no longer seeing the same behavior with switch to v3.0.2 of pmix and using prun.

Details:
2 sample servers are started on the same node using prun to start as part of the set of size=2.
1 server kills self, the other one awaits for pmix notification of the death of the other node.

Sample test used by our project which uses PMIX apis for registration:
https://github.com/daos-stack/cart/blob/master/src/test/test_pmix.c

Prior to running the test we start prte as:
prte --daemonize -system-server -H "our_hostname:*"

The actual test is ran as:
prun --continuous -N 2 -x D_LOG_MASK=INFO tests/test_pmix

The test currently times out after not seeing notification of a dead member.

prte never says "DVM ready" inside SLURM

Thank you for taking the time to submit an issue!

Background information

prte never says "DVM ready" on compute nodes in SLURM cluster

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

git master @ d31f0db

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

git master @ 7962c62d4eeaadfe8411df2e058c8b909fbf529d

(and 3.1.4 release)

Please describe the system on which you are running

  • Operating system/version: CentOS 7.5
  • Computer hardware: x86_64
  • Network type: IB

Details of the problem

prte launched on a SLURM compute node never says "DVM ready".

On a login/bare node, I get "DVM ready" immediately.

Let me know what debugging info to provide.

prun --terminate hangs sometimes

once in a while, prun -terminate hangs.
I can currently reproduce the issue with the latest PMIx v3.0 and my customized prrte from https://github.com/ggouaillardet/prrte/tree/topic/pmix2

Here are some traces

(gdb) info threads
  Id   Target Id         Frame 
  4    Thread 0x7fb7c1cd5700 (LWP 18785) "prun" 0x00007fb7c36886d3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
  3    Thread 0x7fb7bfa3a700 (LWP 18786) "prun" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  2    Thread 0x7fb7bf239700 (LWP 18787) "prun" 0x00007fb7c367f913 in select () at ../sysdeps/unix/syscall-template.S:81
* 1    Thread 0x7fb7c4834740 (LWP 18784) "prun" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
(gdb) bt
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fb7c438e618 in PMIx_tool_finalize () at tool/pmix_tool.c:1143
#2  0x0000000000408b29 in prun (argc=2, argv=0x7ffcbbcbfd68) at prun.c:1229
#3  0x0000000000402e6d in main (argc=2, argv=0x7ffcbbcbfd68) at main.c:13
(gdb) p pmix_globals.connected
$1 = false
(gdb) f 1
#1  0x00007fb7c438e618 in PMIx_tool_finalize () at tool/pmix_tool.c:1143
1143	        PMIX_WAIT_THREAD(&tev.lock);
(gdb) l
1138	            }
1139	            return rc;
1140	        }
1141	
1142	        /* wait for the ack to return */
1143	        PMIX_WAIT_THREAD(&tev.lock);
1144	        PMIX_DESTRUCT_LOCK(&tev.lock);
1145	        if (tev.active) {
1146	            pmix_event_del(&tev.ev);
1147	        }
(gdb) thread 3
[Switching to thread 3 (Thread 0x7fb7bfa3a700 (LWP 18786))]
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135	../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fb7c395fdb0 in pthread_cond_broadcast@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_broadcast.S:136
#2  0x0000000000403f86 in evhandler (evhdlr_registration_id=0, status=-101, source=0x7fb7b800131c, info=0x7fb7b80017d0, ninfo=1, results=0x0, nresults=0, 
    cbfunc=0x7fb7c430aee8 <progress_local_event_hdlr>, cbdata=0x7fb7b8001240) at prun.c:346
#3  0x00007fb7c430d21e in pmix_invoke_local_event_hdlr (chain=0x7fb7b8001240) at event/pmix_event_notification.c:738
#4  0x00007fb7c430fcb6 in pmix_event_timeout_cb (fd=-1, flags=1, arg=0x7fb7b8001240) at event/pmix_event_notification.c:1143
#5  0x00007fb7c3b7ef24 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5
#6  0x00007fb7c4384d85 in progress_engine (obj=0x1584cd8) at runtime/pmix_progress_threads.c:109
#7  0x00007fb7c395b184 in start_thread (arg=0x7fb7bfa3a700) at pthread_create.c:312
#8  0x00007fb7c368803d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) f 2
#2  0x0000000000403f86 in evhandler (evhdlr_registration_id=0, status=-101, source=0x7fb7b800131c, info=0x7fb7b80017d0, ninfo=1, results=0x0, nresults=0, 
    cbfunc=0x7fb7c430aee8 <progress_local_event_hdlr>, cbdata=0x7fb7b8001240) at prun.c:346
346	    OPAL_PMIX_WAKEUP_THREAD(lock);
(gdb) l
341	    lock->status = jobstatus;
342	    if (NULL != msg) {
343	        lock->msg = strdup(msg);
344	    }
345	    /* release the lock */
346	    OPAL_PMIX_WAKEUP_THREAD(lock);
347	
348	    /* we _always_ have to execute the evhandler callback or
349	     * else the event progress engine will hang */
350	    if (NULL != cbfunc) {

I think the race occurs when PMIx_tool_finalize() is invoked when pmix_globals.connected is true but becomes false in the middle of it.
The suprising thing is prun is notified about it (since evhandler() is invoked with status=PMIX_ERR_LOST_CONNECTION_TO_SERVER) and does not take any action which is an other story.
The odd thing is PMIX_PTL_SEND_RECV did not invoke the finwait_cbfunc() callback at all.

All that being said, should we really care ?
I mean that in the case of prun -terminate, could we simply exit(0) after PMIx_Job_control_fn() callback is invoked and not call PMIx_tool_finalize() at all ?

Installation error: cannot stat 'prrte-default-hostfile'

Thank you for taking the time to submit an issue!

Background information

Install fails due to missing file

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

github master @ 2d99acb

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

github master @ d2473b0e641709bb8395823d07d39726543486ae

Please describe the system on which you are running

  • Operating system/version: Fedora 30
  • Computer hardware: x86_64
  • Network type: eth (mainly shared memory)

Details of the problem

$ make install
...
...
...
make[3]: Entering directory '/home/arcurtis/src/prrte/build/src/etc'
make[3]: Nothing to be done for 'install-exec-am'.
/usr/bin/mkdir -p /home/arcurtis/opt/prrte/git/etc
******************************* WARNING ************************************
*** Not installing new prrte-mca-params.conf over existing file in:
***   /home/arcurtis/opt/prrte/git/etc/prrte-mca-params.conf
******************************* WARNING ************************************
/usr/bin/install: cannot stat 'prrte-default-hostfile': No such file or directory
make[3]: *** [Makefile:866: install-data-local] Error 1
make[3]: Leaving directory '/home/arcurtis/src/prrte/build/src/etc'
make[2]: *** [Makefile:749: install-am] Error 2
make[2]: Leaving directory '/home/arcurtis/src/prrte/build/src/etc'
make[1]: *** [Makefile:1862: install-recursive] Error 1
make[1]: Leaving directory '/home/arcurtis/src/prrte/build/src'
make: *** [Makefile:843: install-recursive] Error 1

ORTE_ERROR_LOG: Data unpack would read past end of buffer

Thank you for taking the time to submit an issue!

Background information

Upgraded PMIx and PRRTE, code that was working now crashes

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

git master @ e37bfeb

Is the reference server using its internal version of PMIx, or an external one?

ext

If external, what version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

git master @ aeb383ba1ecb00515de85450abbb9e1d8e113dd8

Describe how PMIx was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

Please describe the system on which you are running

  • Operating system/version: CentOS 7
  • Computer hardware: Penguin, Xeon
  • Network type: infiniband (mlx4, mlx5)

Details of the problem

During startup of my OpenSHMEM library I exchange various bits of info, e.g. for symmetric heap addresses/sizes. I think I am now seeing problems with that when using PRRTE. Launch via Open-MPI's mpirun still works. Errors are:

[cn092:134990] [[46042,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../psrvr-git/orte/orted/pmix/pmix_server_pub.c at line 583
[cn092:134990] [[46042,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../psrvr-git/orte/orted/pmix/pmix_server_pub.c at line 583
[cn092:134990] [[46042,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../psrvr-git/orte/orted/pmix/pmix_server_pub.c at line 583
[cn092:134990] [[46042,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../psrvr-git/orte/orted/pmix/pmix_server_pub.c at line 583
[cn092:134990] [[46042,0],1] errmgr:default_orted:proc_errors process [[46042,0],0] error state LIFELINE LOST
[cn092:134990] [[46042,0],1] errmgr:orted lifeline lost or unable to communicate - exiting

Another oddity is that this program is running on 2 nodes, 2 cores-per-node, but I am only seeing one host here for all 4 ranks.

Read failure message on job termination

Background information

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

PRRTE master at a9ef1f5

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

PMIx master at openpmix/openpmix@a3cfa97

Please describe the system on which you are running

  • Operating system/version: RHEL 7.6
  • Computer hardware: Power8
  • Network type: TCP

Details of the problem

On job completion sometimes I see a message like the below repeated over and over again until the job is killed:

[node01:33166] Read -1, expected 1048576, errno = 14
[node01:33166] Read -1, expected 1048576, errno = 14
[node01:33166] Read -1, expected 1048576, errno = 14

I suspect that this is a race between the prte daemon closing the channel to the prun process and the prun process deregistering it from libevent.

It is a difficult timing window to hit in the wild, but regressing testing in Open MPI via MTT hits this nightly due to the large volume of tests being run.

stdin and PRRTE?

Thank you for taking the time to submit an issue!

Background information

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

git master @ 4c77d72

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

git master @ c82c6dca63036d06e75da3aff8df16165635a56c

Please describe the system on which you are running

  • Operating system/version: CentOS 7
  • Computer hardware: x86_64
  • Network type: IB

Details of the problem

Testing OpenSHMEM program that reads a value from stdin on start-up (rank 0 only).

With OpenMPI 3.1.3 acting as launcher/server with PMIx 2.1.1 all is well.

With PRRTE/PMIx git combination, I get the prompt, enter a value, but then no progress. Also takes significantly longer for the prompt to appear.

Simple test program:

#include <stdio.h>
#include <shmem.h>

int
main()
{
    int n = -1;
    int me;

    shmem_init();
    me = shmem_my_pe();

    if (me == 0) {
        printf("Enter n : "); fflush(stdout);
        fscanf(stdin, "%d", &n);
        printf("You entered %d\n", n);
    }

    shmem_barrier_all();

    printf("PE %d: n = %d\n", me, n);

    return 0;
}

Expected result:

$ oshrun -n 2 ./a.out
oshrun:prrte: found "prun"
oshrun:prrte: check matching "prte"
oshrun:prrte: no "prte", skipping
oshrun:prrte: check matching "psrvr"
oshrun:prrte: no "psrvr", skipping
oshrun:launch: look for "mpiexec"
oshrun:launch: using "/gpfs/projects/ChapmanGroup/opt/openmpi/3.1.3/bin/mpiexec"
oshrun:launch: "mpiexec -n 2 ./a.out"
oshrun:----------------------------------------------------------------------
Enter n : 23
You entered 23
PE 0: n = 23
PE 1: n = -1
oshrun:launch: done

Bad result:

$ oshrun -n 2 ./a.out
oshrun:prrte: found "prun"
oshrun:prrte: check matching "prte"
oshrun:prrte: found "prte"
oshrun:prrte: starting up
oshrun:prrte: pid 31639 says "DVM ready"
oshrun:launch: "prun -n 2 ./a.out"
oshrun:launch: application in process 31741
oshrun:----------------------------------------------------------------------
Enter n : 23
<nothing more, hang>

possible bug in PRTE

Playing a bit with PRRTE I ran into what I suspect might be a bug in prte. I compiled master with external PMIx 3.0.2. I then ran the simple test on my laptop: I started prte -d in one terminal, and used prun to run a client:

prun -np 1 pmix-3.0.2/examples/client
Client ns 2573926402 rank 0: Running
Client 2573926402:0 universe size 4
Client 2573926402:0 num procs 1
Client ns 2573926402 rank 0: Finalizing
Client ns 2573926402 rank 0:PMIx_Finalize successfully completed

So far so good. But then I Ctrl-C the client and didn't let it finish cleanly. After that, when I tried to prun angain, nothing happened. That is, prun doesn't show any output. I still do get debug messages on the prte console, so there is some activity going on between prun and prte, but the client is not executed.

I've noticed that instead of aborting the client with a signal I can break prte in the same way by running the alloc example:

$ ~/work/pmi/install-prrte/bin/prun -np 1 ./alloc
Client ns 2554920962 rank 0: Running
Client 2554920962:0 universe size 4
Allocation request returned PROC-ABORT-REQUESTED

After that prun doesn't do anything anymore. So it seems that some behavior in the client can put the server into an unusable state.

prte_info segv on startup

Running prte_info segfaults on startup

Background information

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

git master @ 52d4988

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

git master @ 88500d4

Please describe the system on which you are running

Linux desktop


Details of the problem

shell$ prte_info
Segmentation fault
shell$ echo $?
139
(gdb) bt
#0  strlen () at ../sysdeps/x86_64/strlen.S:106
#1  0x00007ffff74d247e in __GI___strdup (s=0x0) at strdup.c:41
#2  0x00007ffff7ad0140 in prrte_mca_base_open ()
   from /home/3t4/projects/ompi-ecp/ompi-scratch/CREEPY-CAT/ompi/_install/lib/libprrte.so.2
#3  0x0000000000402143 in main ()
(gdb)

Rank-by processing is broken

The order of processing for the loops that span objects vs procs on a node was reversed at some point. While the change is certainly faster (it put the larger loop on the outside of a smaller one), it unfortunately yields the wrong answer.

prte crashes immediately upon prun

Thank you for taking the time to submit an issue!

Background information

Latest update to PRRTE causes prte crash.

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

git master @ 5c02d19

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

git master @ cf043ddce69cab9f874777bb2bef60e1ea9465d2

Please describe the system on which you are running

  • Operating system/version: CentOS 7
  • Computer hardware: x86_64
  • Network type: IB

Details of the problem

prte crashes as soon as a client contacts it.

$ prte
DVM ready

<wait for prun client>

[cn-mem:07946] *** Process received signal ***
[cn-mem:07946] Signal: Segmentation fault (11)
[cn-mem:07946] Signal code: Address not mapped (1)
[cn-mem:07946] Failing at address: (nil)
[cn-mem:07946] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2aaaacc05370]
[cn-mem:07946] [ 1] /lib64/libc.so.6(vsnprintf+0x62)[0x2aaaace86162]
[cn-mem:07946] [ 2] /lib64/libc.so.6(snprintf+0x82)[0x2aaaace63912]
[cn-mem:07946] [ 3] /gpfs/home/arcurtis/opt/prrte/git/lib/libprteopen-rte.so.0(orte_util_snprintf_jobid+0x19)[0x2aaaaacf7329]
[cn-mem:07946] [ 4] /gpfs/home/arcurtis/opt/prrte/git/lib/libprteopen-rte.so.0(orte_pmix_server_register_nspace+0x2c8a)[0x2aaaaad1553a]
[cn-mem:07946] [ 5] /gpfs/home/arcurtis/opt/prrte/git/lib/libprteopen-rte.so.0(orte_odls_base_default_construct_child_list+0x13ee)[0x2aaaaad3b35e]
[cn-mem:07946] [ 6] /gpfs/home/arcurtis/opt/prrte/git/lib/pmix/mca_odls_default.so(+0x253e)[0x2aaab3a1d53e]
[cn-mem:07946] [ 7] /gpfs/home/arcurtis/opt/prrte/git/lib/libprteopen-rte.so.0(orte_daemon_recv+0xabb)[0x2aaaaad06f6b]
[cn-mem:07946] [ 8] /gpfs/home/arcurtis/opt/prrte/git/lib/libprteopen-rte.so.0(orte_rml_base_process_msg+0x14b)[0x2aaaaad5e2eb]
[cn-mem:07946] [ 9] /gpfs/projects/ChapmanGroup/opt/libevent/lib/libevent-2.1.so.6(+0x2153d)[0x2aaaab90d53d]
[cn-mem:07946] [10] /gpfs/projects/ChapmanGroup/opt/libevent/lib/libevent-2.1.so.6(event_base_loop+0x3ef)[0x2aaaab90dc4f]
[cn-mem:07946] [11] prte[0x4028ad]
[cn-mem:07946] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaace33b35]
[cn-mem:07946] [13] prte[0x401b29]
[cn-mem:07946] *** End of error message ***
Segmentation fault (core dumped)

Test on local host: PMIX ERROR: NOT-FOUND in file tool/pmix_tool.c at line 250

Thank you for taking the time to submit an issue!

Background information

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

master@70b72b7ea9f1be771f962d1c3f205f3dab6bf529

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

master@089187ccb2ff2c10de09c8cc082ec76fadac897e

Please describe the system on which you are running

  • Operating system/version: Ubuntu Disco
  • Computer hardware: amd64
  • Network type: localhost test, no network really involved

Details of the problem

While trying to refresh my knowledge about PRRTE, I tried to run a simple test: on a single node, start prte and run a simple prun command to execute /bin/hostname.
Here are the commands I used:

$ $HOME/install/prrte_singularity/bin/prte --host localhost

On on a different terminal

$ $HOME/install/prrte_singularity/bin/prun  /bin/hostname 

which systematically gives me the following error:

[pessoa3:06117] PMIX ERROR: NOT-FOUND in file tool/pmix_tool.c at line 250

Am I doing something wrong?

RFC - PRRTE DVM testing

Thank you for taking the time to submit an issue!

Background information

We have the need to test the DVM features provided by PRRTE through the prte and prun commands (basically testing the run-a-job-in-job model). These tests need to be executed on OLCF platforms at ORNL.

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

master

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

master

Please describe the system on which you are running

  • Operating system/version:
  • Computer hardware:
  • Network type:

Details of the problem

Overview

The goal so far was to provide all the mechanisms to test different use cases of PRRTE-DVM on OLCF systems. This implies a few constraints:

  • provide a tool that can easily support new use cases (more details further below) that reflect needs for new users that are not already covered,
  • for this tool, prefer to rely on a programming language that is available on all target OLCF platforms,
  • make sure that the test are pass/fail so it can later be integrated with MTT. Note that we are not at the moment looking at extending MTT to support this new execution model (i.e., "run a job in a job") since it would imply more coordinated work with the MTT team for little benefit. We believe having a separate tool that could be used in MTT to test DVM capabilities is a cleaner and more efficient option (that point can be discussed though).

Use cases

The goal is to test different use cases relying on the distributed virtual machine capabilities of PRRTE. This use cases are driven by application teams' feedback, at the moment mainly from ORNL and the RADICAL team (http://radical.rutgers.edu).

Supported use cases

Many nodes; use all local cores but no oversubscription); short living applications

The goal of this use case is to test the scalability when using as many nodes as possible on a platform, while using all the computing resources (cores at the moment) on compute nodes. The test shall fail if all the nodes on the platform (or at least a target number of nodes) cannot be used to run a simple /bin/hostname on each node of the allocation.

Future use cases

Many nodes; oversubscription; short living applications

The goal of this use case is to test the scalability of DVM when using as many nodes as possible with oversubscription and short leaving applications. The workload will be predefined (many tasks model) and the test will discover the upper limit in term of number of nodes to run the test. The test will succeed if the upper limit is equal or superior to the target number of nodes for a given platform. The idea behind this test is also to assume that users can submit a large number of sub-jobs and DVM will throttle the sub-job execution to guarantee large throughput (we do not have any quantitative requirements regarding the throughput at the moment).

Many nodes; no oversubscription; many application with random run times

The goal of this use case is to test the scalability of DVM when using as many nodes as possible with no oversubscription and applications that run for a random amount of time. The number of tasks will be predefined but the total execution time required by the workload will be defined at runtime. The goal of this test is to evaluate the robustness of the infrastructure when running different types of applications.

Resource Manager Interaction

Because of the environment at our center, integration with job/resource managers is mandatory (we cannot rely only on tests that require interactive sessions). This implies the need for an architecture where various resource/job managers can easily be added.

Current state

A simple proof-of-concept has been developed and used for evaluation on OLCF systems. The current version has been developed in Perl, the only programming language available on all target platforms when the project started. The programming language choice could be reconsidered at this time.

Developments are based on an incremental approach, meaning that only the first supported use case is currently implemented. Testing is at the moment focusing on the Summitdev system at ORNL and once we will be able to pass our first test on the entire system, that test will be executed at larger scale on Summit, while other use cases will be implemented and tested on Summitdev.

prun: propagate app exit code to caller?

Thank you for taking the time to submit an issue!

Background information

Trying to trap a exit codes from an app launched via prun (and prte).

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

git master @093174a070fbcf76bb00df78c7d4f4b0ef3c4da8

Is the reference server using its internal version of PMIx, or an external one?

External

If external, what version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

git master @35ae8a0320c32023fb1a2ab087352d07e888bfc9

Describe how PMIx was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

source

Please describe the system on which you are running

  • Operating system/version: CentOS 7
  • Computer hardware: x86_64
  • Network type: IBV

Details of the problem

If the app exit(1)s or equivalent, can I get this back from prun? prun always seems to exit(0)

PSRVR integration errors when using external PMIx v2.0

Looks like there have been a number of changes to the PSRVR core code that reflect PMIx v3 support, thereby causing problems when built against an external v2.x, including:

[rhc001:242687] [[545,1],0] ORTE_ERROR_LOG: Not found in file base/ess_base_std_tool.c at line 311
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  store HNP URI failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[rhc001:242687] [[545,1],0] ORTE_ERROR_LOG: Not found in file ess_tool_module.c at line 130

and

[rhc001:242323] ptl:tcp: connecting to server
[rhc001:242323] ptl:tcp:tool searching for session server pmix.rhc001.tool
[rhc001:242323] pmix:tcp: searching directory /tmp
[rhc001:242323] pmix:tcp: ignoring .XIM-unix
[rhc001:242323] pmix:tcp: ignoring .X11-unix
[rhc001:242323] pmix:tcp: ignoring .font-unix
[rhc001:242323] pmix:tcp: ignoring .ICE-unix
[rhc001:242323] pmix:tcp: ignoring .Test-unix
[rhc001:242323] pmix:tcp: ignoring systemd-private-aba1df39ec8b4bc8a77564371fae743e-httpd.service-DlvHDw
[rhc001:242323] pmix:tcp: ignoring am4t8CPKnG
[rhc001:242323] pmix:tcp: ignoring am4tjvHRDk
[rhc001:242323] pmix:tcp: ignoring pmix.sys.rhc001
[rhc001:242323] pmix:tcp: ignoring systemd-private-aba1df39ec8b4bc8a77564371fae743e-rtkit-daemon.service-jIWT6h
[rhc001:242323] pmix:tcp: ignoring systemd-private-aba1df39ec8b4bc8a77564371fae743e-mariadb.service-S4EoAs
[rhc001:242323] pmix:tcp: ignoring .X0-lock
[rhc001:242323] pmix:tcp: ignoring arvjQrOH
[rhc001:242323] pmix:tcp: ignoring yum_save_tx.2018-01-18.03-46.NgOPvZ.yumtx
[rhc001:242323] pmix:tcp: ignoring ompi.rhc001.1000
[rhc001:242323] pmix:tcp: ignoring hsperfdata_root
[rhc001:242323] pmix:tcp: ignoring systemd-private-aba1df39ec8b4bc8a77564371fae743e-chronyd.service-Xo0Sr2
[rhc001:242323] pmix:tcp: ignoring systemd-private-aba1df39ec8b4bc8a77564371fae743e-cups.service-KOOi8o
[rhc001:242323] pmix:tcp: ignoring yum_save_tx.2018-01-17.03-56.4NynkC.yumtx
[rhc001:242323] pmix:tcp: ignoring yum_save_tx.2018-01-19.08-28.XaRItc.yumtx
[rhc001:242323] pmix:tcp: ignoring systemd-private-aba1df39ec8b4bc8a77564371fae743e-colord.service-cx3Kep
[rhc001:242323] OPAL ERROR: Unreachable in file ext2x_client.c at line 240
[rhc001:242323] [[840,0],0] ORTE_ERROR_LOG: Unreachable in file base/ess_base_std_tool.c at line 192
--------------------------------------------------------------------------

and finally, daemon wireup support is busted:

[rhc001:242103] [[108,0],0] ORTE_ERROR_LOG: Not found in file state_dvm.c at line 300

configure - Wrong error message

Thank you for taking the time to submit an issue!

Background information

With compiling PRRTE and when I forget to specify the location where I installed the PMIx library, the configure script tells me to use the --with-external-pmix option but the correct option is, as far as I can tell, --with-pmix, not --with-external-pmix

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

3.0.0rc1

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

3.0.2

Please describe the system on which you are running

  • Operating system/version:
$ lsb_release -a
LSB Version:	:core-4.1-noarch:core-4.1-ppc64le
Distributor ID:	RedHatEnterpriseServer
Description:	Red Hat Enterprise Linux Server release 7.5 (Maipo)
Release:	7.5
Codename:	Maipo
  • Computer hardware: IBM Power 8
  • Network type: Mellanox IB

Details of the problem

With compiling PRRTE and when I forget to specify the location where I installed the PMIx library, I am getting the following error message:

============================================================================
== Configure PMIx
============================================================================
checking --with-external-pmix value... not found
configure: WARNING: Expected file /usr/include/pmix.h not found
configure: error: Cannot continue

However, the correct option is, as far as I can tell, --with-pmix and not --with-external-pmix. When using --with-pmix, everything is fine.

IOF fails to forward after first job is executed

The first time you invoke a job that spawns another job, the IO from the child job is properly forwarded and output by the parent. However, if you invoke the job again, the IO from the child job is lost. It appears that something in the IOF gets confused and left in a "do not forward" state.

Spawn isn't respecting --oversubscribe

 I'm launching this in my MTT (in a SLURM allocation with 2 servers, each with
16 cores) -- note the use of --oversubscribe in here:

-----
mpirun  --oversubscribe --bind-to none -np 32 --mca orte_startup_timeout 10000
--mca oob tcp --mca btl tcp,self --mca mpi_leave_pinned_pipeline 1
src/mpi2c++_dynamics_test
-----

And I'm getting this:

-----
MPI-2 C++ bindings MPI-2 dynamics test suite
------------------------------
Open MPI Version 2.0

*** There are delays built into some of the tests
*** Please let them complete
*** No test should take more than 10 seconds

Test suite running with 32 processes
* MPI-2 Dynamics...                               
 - Looking for "connect" program...              PASS
 - MPI::Get_version...                           PASS
 - MPI::Open_port...                             PASS
 - MPI::Intercomm::Spawn...                     
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 1 slots
that were requested by the application:
 src/connect

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[mpi015:28434] *** An
error occurred in MPI_Comm_spawn
[mpi015:28434] *** reported by process [549847041,0]
[mpi015:28434] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[mpi015:28434] *** MPI_ERR_SPAWN: could not spawn processes
[mpi015:28434] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[mpi015:28434] ***    and potentially your MPI job)
-----

Yes, I'm running with -np 32 on 32 slots, but I said --oversubscribe.  So why did it fail?

Problem in IOF subsystem

@ggouaillardet

I've tracked down the instability problem to an issue in the IOF, specifically when forwarding IO to the prun tool. At some point, we wind up hitting a situation where the write fails (reason unclear) and then the msg attempts to be released, only to segfault due to a bad pointer:

Execution 20
[Ralphs-iMac-2.local:18679] SENDONEWAY server/pmix_server.c:1677:2
[Ralphs-iMac-2.local:18679] pmix_ptl_base: send_msg: write failed: Invalid argument (22) [sd = 14]
prte(18679,0x70000c640000) malloc: *** error for object 0x7fc5ed00f348: pointer being freed was not allocated
prte(18679,0x70000c640000) malloc: *** set a breakpoint in malloc_error_break to debug
[Ralphs-iMac-2.local:19175] PMIX ERROR: UNPACK-PAST-END in file event/pmix_event_registration.c at line 99

The last error comes from prun because it received only a partial payload.

It is unclear to me why the msg object is getting trashed. Here is what I see from gdb of the resulting prte core file:

(gdb) thread apply all where

Thread 5 (core thread 4):
#0  0x00007fff7c6655aa in select$DARWIN_EXTSN ()
#1  0x00000001029432ab in listen_thread (obj=0x102951850) at oob_tcp_listener.c:705
#2  0x00007fff7c718305 in _pthread_body ()
#3  0x00007fff7c71b26f in _pthread_start ()
#4  0x00007fff7c717415 in thread_start ()

Thread 4 (core thread 3):
#0  0x00007fff7c664716 in kevent ()
#1  0x00000001005a70c6 in kq_dispatch (base=0x7fc5eaf0c380, tv=<value temporarily unavailable, due to optimizations>) at kqueue.c:302
#2  0x00007000000f421a in ?? ()

Thread 3 (core thread 2):
#0  0x00007fff7c6655aa in select$DARWIN_EXTSN ()
#1  0x00000001004f2713 in listen_thread (obj=0x0) at base/ptl_base_listener.c:214
#2  0x00007fff7c718305 in _pthread_body ()
#3  0x00007fff7c71b26f in _pthread_start ()
#4  0x00007fff7c717415 in thread_start ()

Thread 2 (core thread 1):
#0  0x00007fff7c66423e in __pthread_kill ()
#1  0x00007fff7c71ac1c in pthread_kill ()
#2  0x00007fff7c5cd268 in __abort ()
#3  0x00007fff7c5cd1d8 in abort ()
#4  0x00007fff7c6dc6e2 in malloc_vreport ()
#5  0x00007fff7c6dc4a3 in malloc_report ()
#6  0x00000001004efd63 in pmix_ptl_base_send_handler (sd=14, flags=4, cbdata=0x7fc5ed00f1c0) at base/ptl_base_sendrecv.c:438
#7  0x000000010059e0e8 in event_process_active_single_queue (base=0x7fc5eaf005b0, activeq=0x7fc5eaf00880, max_to_process=2147483647, endtime=0x0) at event.c:1580
Previous frame inner to this frame (gdb could not unwind past this frame)

Thread 1 (core thread 0):
#0  0x00007fff7c6617de in __psynch_cvwait ()
#1  0x00007fff7c71b593 in _pthread_cond_wait ()
#2  0x00000001002af994 in orte_state_base_track_procs (fd=-1, argc=4, cbdata=0x7fc5ed0333b0) at base/state_base_fns.c:732
#3  0x000000010059de75 in event_process_active_single_queue (base=0x7fc5eac14860, activeq=0x7fc5eac14bf0, max_to_process=2147483647, endtime=0x0) at event.c:1646
Previous frame inner to this frame (gdb could not unwind past this frame)
(gdb) thread 2\

[Switching to thread 2 (core thread 1)]
0x00007fff7c66423e in __pthread_kill ()
(gdb) up
#1  0x00007fff7c71ac1c in pthread_kill ()
(gdb) up
#2  0x00007fff7c5cd268 in __abort ()
(gdb) up
#3  0x00007fff7c5cd1d8 in abort ()
(gdb) up
#4  0x00007fff7c6dc6e2 in malloc_vreport ()
(gdb) up
#5  0x00007fff7c6dc4a3 in malloc_report ()
(gdb) up
#6  0x00000001004efd63 in pmix_ptl_base_send_handler (sd=14, flags=4, cbdata=0x7fc5ed00f1c0) at base/ptl_base_sendrecv.c:438
438	            PMIX_RELEASE(msg);
(gdb) print msg
$1 = (pmix_ptl_send_t *) 0x7fc5ed00f348
(gdb) print *msg
$2 = {
  super = {
    super = {
      obj_magic_id = 0, 
      obj_class = 0x100526940, 
      obj_reference_count = 0, 
      cls_init_file_name = 0x1005071e7 "base/ptl_base_sendrecv.c", 
      cls_init_lineno = 438
    }, 
    pmix_list_next = 0x0, 
    pmix_list_prev = 0x0, 
    item_free = 1, 
    pmix_list_item_refcount = 0, 
    pmix_list_item_belong_to = 0x0
  }, 
  ev = {
    ev_evcallback = {
      evcb_active_next = {
        tqe_next = 0xd8, 
        tqe_prev = 0x7fc5ed00f348
      }, 
      evcb_flags = 0, 
      evcb_pri = 0 '\0', 
      evcb_closure = 0 '\0', 
      evcb_cb_union = {
        evcb_callback = 0x7fc500000000, 
        evcb_selfcb = 0x7fc500000000, 
        evcb_evfinalize = 0x7fc500000000, 
        evcb_cbfinalize = 0x7fc500000000
      }, 
      evcb_arg = 0x14000003e8
    }, 
    ev_timeout_pos = {
      ev_next_with_common_timeout = {
        tqe_next = 0xdeafbeeddeafbeed, 
        tqe_prev = 0x100526980
      }, 
      min_heap_idx = -558907667
    }, 
    ev_fd = 1, 
    ev_base = 0x1004f84ae, 
    ev_ = {
      ev_io = {
        ev_io_next = {
          le_next = 0x7fc5000000ba, 
          le_prev = 0xdeafbeeddeafbeed
        }, 
        ev_timeout = {
          tv_sec = 4300368192, 
          tv_usec = 1
        }
      }, 
      ev_signal = {
        ev_signal_next = {
          le_next = 0x7fc5000000ba, 
          le_prev = 0xdeafbeeddeafbeed
        }, 
        ev_ncalls = 26944, 
        ev_pncalls = 0x7fc500000001
      }
    }, 
    ev_events = 28067, 
    ev_res = 79, 
    ev_timeout = {
      tv_sec = 140484085284953, 
      tv_usec = -318704672
    }
  }, 
  hdr = {
    pindex = -318704672, 
    tag = 32709, 
    nbytes = 4294967297
  }, 
  data = 0x7fc5ed00f3b8, 
  hdr_sent = false, 
  sdptr = 0xdeafbeeddeafbeed <Address 0xdeafbeeddeafbeed out of bounds>, 
  sdbytes = 4300368256
}
(gdb) print msg->data
$3 = (pmix_buffer_t *) 0x7fc5ed00f3b8
(gdb) print *msg->data
$4 = {
  parent = {
    obj_magic_id = 16046253926196952813, 
    obj_class = 0x100526980, 
    obj_reference_count = 1, 
    cls_init_file_name = 0x1004f84ae "include/pmix_globals.c", 
    cls_init_lineno = 186
  }, 
  type = 237 '?', 
  base_ptr = 0x100526940 "?mO", 
  pack_ptr = 0x7fc500000001 <Address 0x7fc500000001 out of bounds>, 
  unpack_ptr = 0x1004f6da3 "class/pmix_list.c", 
  bytes_allocated = 140484085284953, 
  bytes_used = 140488061547488
}

Can you take a look when you return? I'm guessing that we were only able to do a partial send, and that messed up the pointer to msg such that the subsequent attempt to complete the send errors out. Since the msg pointer has been messed up, we hit the "malloc free" error and abort.

non friendly error message when a job cannot be spawned

With latest prte and pmix from their master branches.

I incorrectly passed the -c flag to pcc and hence generated an object file when I really meant to generate a binary. Then I tried to prun it and ended up with a non understandable error message

$ ~/local/prrte/bin/pcc -c -g -O0 -o hello hello.c 
$ ~/local/prrte/bin/prun -n 1 ./hello
[c7:31224] Job failed to spawn: ERROR STRING NOT FOUND

The bash error is a Permission denied (since the executable bit is not set on an object), and I would expect a similar error is reported by prun.

unexpected prun behavior/errors

Thank you for taking the time to submit an issue!

Background information

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

Using master branch of prrte (a187840)

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

Using master branch of PMIx (1ca482d)

Please describe the system on which you are running

Happening on multiple systems, including Cori (Cray system @ NERSC) and Cooley (Linux cluster @ ALCF).

  • Operating system/version:
  • Computer hardware:
  • Network type:

Details of the problem

I am trying to roll my own PMIx environment for testing on Cori and running into issues trying to launch jobs across multiple nodes (2 nodes for now). I am able to invoke prte across both nodes, without error (I think, as "DVM Ready" is being printed and no visible errors).

However, the behavior I get when launching jobs is not as expected. Just trying to run something simple like hostname to verify my processes are being distributed as I would like (round-robin across nodes) is yielding the following:

ssnyder@nid00098:~/software/ssg/build> ~/software/pmix/prrte/install/bin/prun -n 4 hostname
nid00098
nid00098
nid00098
nid00098

So, all 4 processes are being launched on a single node. Maybe that's expected behavior for prun in the absence of other command line arguments. It looks like there are maybe multiple ways to more explicitly get this sort of behavior, so I tried the following:

ssnyder@nid00098:~/software/ssg/build> ~/software/pmix/prrte/install/bin/prun --map-by node -n 4 hostname
nid00098
nid00098
prun: /global/homes/s/ssnyder/software/pmix/pmix/src/class/pmix_list.h:564: _pmix_list_append: Assertion `0 == item->pmix_list_item_refcount' failed.
[nid00098:47313] *** Process received signal ***
[nid00098:47313] Signal: Aborted (6)
[nid00098:47313] Signal code:  (-6)
[nid00098:47313] [ 0] /lib64/libpthread.so.0(+0x12360)[0x2aaacc842360]
[nid00098:47313] [ 1] /lib64/libc.so.6(gsignal+0x110)[0x2aaacca84160]
[nid00098:47313] [ 2] /lib64/libc.so.6(abort+0x151)[0x2aaacca85741]
[nid00098:47313] [ 3] /lib64/libc.so.6(+0x2e75a)[0x2aaacca7c75a]
[nid00098:47313] [ 4] /lib64/libc.so.6(+0x2e7d2)[0x2aaacca7c7d2]
[nid00098:47313] [ 5] /global/homes/s/ssnyder/software/pmix/pmix/install/lib/pmix/mca_gds_hash.so(+0x2fa6)[0x2aaad08bffa6]
[nid00098:47313] [ 6] /global/homes/s/ssnyder/software/pmix/pmix/install/lib/pmix/mca_gds_hash.so(+0x614d)[0x2aaad08c314d]
[nid00098:47313] [ 7] /global/homes/s/ssnyder/software/pmix/pmix/install/lib/pmix/mca_gds_hash.so(+0x1203b)[0x2aaad08cf03b]
[nid00098:47313] [ 8] /global/homes/s/ssnyder/software/pmix/pmix/install/lib/libpmix.so.0(+0x68713)[0x2aaaab2e2713]
[nid00098:47313] [ 9] /global/homes/s/ssnyder/software/pmix/pmix/install/lib/libpmix.so.0(pmix_ptl_base_process_msg+0x35f)[0x2aaaab3b6001]
[nid00098:47313] [10] /global/u2/s/ssnyder/software/spack/opt/spack/cray-cnl7-haswell/gcc-8.3.0/libevent-2.1.8-a2ij5ml7twhl6oxmxtesm2fkjoafjaz5/lib/libevent-2.1.so.6[0x20023a15]
[nid00098:47313] [11] /global/u2/s/ssnyder/software/spack/opt/spack/cray-cnl7-haswell/gcc-8.3.0/libevent-2.1.8-a2ij5ml7twhl6oxmxtesm2fkjoafjaz5/lib/libevent-2.1.so.6(event_base_loop+0x51f)[0x200243ef]
[nid00098:47313] [12] /global/homes/s/ssnyder/software/pmix/pmix/install/lib/libpmix.so.0(+0xc2f3e)[0x2aaaab33cf3e]
[nid00098:47313] [13] /lib64/libpthread.so.0(+0x7569)[0x2aaacc837569]
[nid00098:47313] [14] /lib64/libc.so.6(clone+0x3f)[0x2aaaccb46a2f]
[nid00098:47313] *** End of error message ***
Aborted

So for some reason, prun really didn't like that. It invokes 2 processes on my first node (nid00098), but never does so on the other node (nid00099). I suspected maybe prte is just not running properly on the other node despite no errors when launching,

ssnyder@nid00098:~/software/ssg/build> ps aux | grep prte
ssnyder  48322  0.4  0.0 1418960 17468 pts/0   Sl   12:42   0:00 /global/homes/s/ssnyder/software/pmix/prrte/install/bin/prte -prefix /global/homes/s/ssnyder/software/pmix/prrte/install/
ssnyder  48327  0.2  0.0 393060 18552 pts/0    Sl   12:42   0:00 srun --ntasks-per-node=1 --kill-on-bad-exit --nodes=1 --nodelist=nid00099 --ntasks=1 prted -pmca ess "slurm" -pmca ess_base_jobid "1267859456" -pmca ess_base_vpid "1" -pmca ess_base_num_procs "2" -pmca orte_hnp_uri "1267859456.0;tcp://10.128.0.99:37339"
ssnyder  48337  0.0  0.0 188172  2300 pts/0    S    12:42   0:00 srun --ntasks-per-node=1 --kill-on-bad-exit --nodes=1 --nodelist=nid00099 --ntasks=1 prted -pmca ess "slurm" -pmca ess_base_jobid "1267859456" -pmca ess_base_vpid "1" -pmca ess_base_num_procs "2" -pmca orte_hnp_uri "1267859456.0;tcp://10.128.0.99:37339"

ssnyder@nid00098:~/software/ssg/build> ssh nid00099

ssnyder@nid00099:~> ps aux | grep prte
ssnyder  57297  0.2  0.0 1347144 17192 ?       Sl   12:42   0:00 /global/homes/s/ssnyder/software/pmix/prrte/install/bin/prted -pmca ess "slurm" -pmca ess_base_jobid "1267859456" -pmca ess_base_vpid "1" -pmca ess_base_num_procs "2" -pmca orte_hnp_uri "1267859456.0;tcp://10.128.0.99:37339"

Here's the corresponding bt for the case of the prun failure:

Program terminated with signal SIGABRT, Aborted.
#0  0x00002aaacca84160 in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x2aaad16ee700 (LWP 48889))]
(gdb) bt
#0  0x00002aaacca84160 in raise () from /lib64/libc.so.6
#1  0x00002aaacca85741 in abort () from /lib64/libc.so.6
#2  0x00002aaacca7c75a in __assert_fail_base () from /lib64/libc.so.6
#3  0x00002aaacca7c7d2 in __assert_fail () from /lib64/libc.so.6
#4  0x00002aaad08bffa6 in _pmix_list_append (list=0x2aaad4001bd0, item=0x100000bdfa0, 
    FILE_NAME=0x2aaad08d6790 "../../../../../src/mca/gds/hash/gds_hash.c", LINENO=383)
    at /global/homes/s/ssnyder/software/pmix/pmix/src/class/pmix_list.h:564
#5  0x00002aaad08c314d in process_node_array (val=0x2aaad4003220, tgt=0x2aaad4001bd0)
    at ../../../../../src/mca/gds/hash/gds_hash.c:383
#6  0x00002aaad08cf03b in hash_store_job_info (nspace=0x2aaad16eda40 "1233256450", 
    buf=0x2aaad16edcb0) at ../../../../../src/mca/gds/hash/gds_hash.c:1721
#7  0x00002aaaab2e2713 in wait_cbfunc (pr=0x100000a4440, hdr=0x100000bc044, buf=0x2aaad16edcb0, 
    cbdata=0x100000bd3c0) at ../../src/client/pmix_client_spawn.c:345
#8  0x00002aaaab3b6001 in pmix_ptl_base_process_msg (fd=-1, flags=4, cbdata=0x100000bbf70)
    at ../../../../src/mca/ptl/base/ptl_base_sendrecv.c:807
#9  0x0000000020023a15 in event_process_active_single_queue (base=base@entry=0x100000a3d10, 
    activeq=0x100000a4160, max_to_process=max_to_process@entry=2147483647, 
    endtime=endtime@entry=0x0) at event.c:1646
#10 0x00000000200243ef in event_process_active (base=0x100000a3d10) at event.c:1738
#11 event_base_loop (base=0x100000a3d10, flags=<optimized out>) at event.c:1961
#12 0x00002aaaab33cf3e in progress_engine (obj=0x100000a3c80)
    at ../../src/runtime/pmix_progress_threads.c:232
#13 0x00002aaacc837569 in start_thread () from /lib64/libpthread.so.0
#14 0x00002aaaccb46a2f in clone () from /lib64/libc.so.6

Any ideas on what could be happening? Are there any utilities I can run to sanity check my server deployment (i.e., verify PMIx recognizes 2 servers which processes can be invoked on)? Is there a convenient way to get more verbose logging/reporting to see if there are any hints on what the issue is? FWIW, I get identical behavior on another cluster (Cooley system @ ALCF), but it is using rsh for the PLM as this system uses Cobalt scheduler rather than Slurm (and thus I have to explicitly provide a node list). Maybe I'm just not setting something up properly?

abnormal finalize: iof hnp finalize before all processes complete iof

Thank you for taking the time to submit an issue!

Background information

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

master @ ffe3dd3

Is the reference server using its internal version of PMIx, or an external one?

external

If external, what version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

master @ 3f81378fc76c12c6564c2fce2c69608a286a1707

Describe how PMIx was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone (with external libevent, pmix, enable-debug)

Please describe the system on which you are running

  • Operating system/version:Scientific Linux 7.4 (Nitrogen)
  • Computer hardware:x86-64
  • Network type: infiniband

Details of the problem

iof_hnp finalizes before all processes IOF COMPLETE, the read handlers of some processes are not released when hnp start to finalize.

$prte -pmca pmix ext4x -pmca routed direct  -pmca pmix_base_verbose 2  -pmca iof_base_verbose 10 -pmca state_base_verbose 10  -debug-daemons

runing mpi application compiled with mpicc. ompi and prrte using the same external pmix and libevent.

I included verbose output from iof and state, sorry about so much information, but i think this is helpful.

[saturn.icl.utk.edu:81156] [[6128,0],0] orted_cmd: received add_local_procs
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pulling fd 38 for process [[6128,14],0]
[saturn.icl.utk.edu:81156] defining endpt: file iof_hnp.c line 366 fd 38
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pushing fd 32 for process [[6128,14],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] defining read event for [[6128,14],0]: iof_hnp.c 187
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pushing fd 39 for process [[6128,14],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] defining read event for [[6128,14],0]: iof_hnp.c 190
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pushing fd 41 for process [[6128,14],1]
[saturn.icl.utk.edu:81156] [[6128,0],0] defining read event for [[6128,14],1]: iof_hnp.c 187
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pushing fd 43 for process [[6128,14],1]
[saturn.icl.utk.edu:81156] [[6128,0],0] defining read event for [[6128,14],1]: iof_hnp.c 190
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pushing fd 45 for process [[6128,14],2]
[saturn.icl.utk.edu:81156] [[6128,0],0] defining read event for [[6128,14],2]: iof_hnp.c 187
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pushing fd 47 for process [[6128,14],2]
[saturn.icl.utk.edu:81156] [[6128,0],0] defining read event for [[6128,14],2]: iof_hnp.c 190
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pushing fd 49 for process [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] defining read event for [[6128,14],3]: iof_hnp.c 187
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pushing fd 51 for process [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] defining read event for [[6128,14],3]: iof_hnp.c 190
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],0] STATE RUNNING AT base/odls_base_default_fns.c:1185
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],0] STATE RUNNING PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],1] STATE RUNNING AT base/odls_base_default_fns.c:1185
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],1] STATE RUNNING PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],2] STATE RUNNING AT base/odls_base_default_fns.c:1185
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],2] STATE RUNNING PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE RUNNING AT base/odls_base_default_fns.c:1185
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE RUNNING PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],0] state RUNNING
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],1] state RUNNING
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],2] state RUNNING
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],3] state RUNNING
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE JOB [6128,14] STATE RUNNING AT base/state_base_fns.c:683
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING JOB [6128,14] STATE RUNNING PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp pushing fd 0 for process [[6128,14],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],0] STATE SYNC REGISTERED AT orted/pmix/pmix_server_gen.c:89
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],0] STATE SYNC REGISTERED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],0] state SYNC REGISTERED
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],1] STATE SYNC REGISTERED AT orted/pmix/pmix_server_gen.c:89
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],1] STATE SYNC REGISTERED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],1] state SYNC REGISTERED
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE SYNC REGISTERED AT orted/pmix/pmix_server_gen.c:89
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE SYNC REGISTERED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],2] STATE SYNC REGISTERED AT orted/pmix/pmix_server_gen.c:89
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],2] STATE SYNC REGISTERED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],3] state SYNC REGISTERED
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],2] state SYNC REGISTERED
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE JOB [6128,14] STATE SYNC REGISTERED AT base/state_base_fns.c:693
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING JOB [6128,14] STATE SYNC REGISTERED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE JOB [6128,14] STATE READY FOR DEBUGGERS AT base/plm_base_launch_support.c:731
[saturn.icl.utk.edu:81156] ACTIVATE: ANY STATE NOT FOUND
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],2] of size 31 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 31 bytes from stdout of [[6128,14],2]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],0] of size 31 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 31 bytes from stdout of [[6128,14],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],1] of size 146 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 146 bytes from stdout of [[6128,14],1]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],3] of size 31 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 31 bytes from stdout of [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],0] of size 116 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 116 bytes from stdout of [[6128,14],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],3] of size 116 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 116 bytes from stdout of [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],2] of size 346 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 346 bytes from stdout of [[6128,14],2]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],1] of size 346 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 346 bytes from stdout of [[6128,14],1]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],0] of size 231 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 231 bytes from stdout of [[6128,14],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],3] of size 243 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 243 bytes from stdout of [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp:read handler [[6128,14],3] Error on connection:49
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],3] of size 0 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 0 bytes from stdout of [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],3] of size 0 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 0 bytes from stderr of [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE IOF COMPLETE AT iof_hnp_read.c:328
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE IOF COMPLETE PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],3] state IOF COMPLETE
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE ABORTED BY SIGNAL AT base/odls_base_default_fns.c:1897
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE ABORTED BY SIGNAL PRI 0
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE WAITPID FIRED AT errmgr_default_hnp.c:647
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE WAITPID FIRED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],3] state WAITPID FIRED
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE NORMALLY TERMINATED AT base/state_base_fns.c:715
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE NORMALLY TERMINATED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],3] state NORMALLY TERMINATED
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE NORMALLY TERMINATED AT errmgr_default_hnp.c:206
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE NORMALLY TERMINATED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:cleanup_node on proc [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE ABORTED BY SIGNAL AT base/plm_base_receive.c:352
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE ABORTED BY SIGNAL PRI 0
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE WAITPID FIRED AT errmgr_default_hnp.c:647
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE WAITPID FIRED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],3] state NORMALLY TERMINATED
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:cleanup_node on proc [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],3] state WAITPID FIRED
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],3] STATE NORMALLY TERMINATED AT base/state_base_fns.c:715
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],3] STATE NORMALLY TERMINATED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],3] state NORMALLY TERMINATED
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:cleanup_node on proc [[6128,14],3]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],0] of size 97 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 97 bytes from stdout of [[6128,14],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],1] of size 106 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 106 bytes from stdout of [[6128,14],1]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],2] of size 106 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 106 bytes from stdout of [[6128,14],2]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],0] of size 97 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 97 bytes from stdout of [[6128,14],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp:read handler [[6128,14],1] Error on connection:41
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],1] of size 0 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 0 bytes from stdout of [[6128,14],1]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],1] of size 0 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 0 bytes from stderr of [[6128,14],1]
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],1] STATE IOF COMPLETE AT iof_hnp_read.c:328
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],1] STATE IOF COMPLETE PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],1] state IOF COMPLETE
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],1] STATE WAITPID FIRED AT base/odls_base_default_fns.c:1897
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],1] STATE WAITPID FIRED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],1] state WAITPID FIRED
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE PROC [[6128,14],1] STATE NORMALLY TERMINATED AT base/state_base_fns.c:715
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING PROC [[6128,14],1] STATE NORMALLY TERMINATED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:track_procs called for proc [[6128,14],1] state NORMALLY TERMINATED
[saturn.icl.utk.edu:81156] [[6128,0],0] state:base:cleanup_node on proc [[6128,14],1]
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE JOB [6128,14] STATE NORMALLY TERMINATED AT base/state_base_fns.c:775
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING JOB [6128,14] STATE NORMALLY TERMINATED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] state:dvm:check_job_complete on job [6128,14]
[saturn.icl.utk.edu:81156] [[6128,0],0] state:dvm releasing procs from node saturn
[saturn.icl.utk.edu:81156] [[6128,0],0] state:dvm releasing proc [[6128,14],0] from node saturn
[saturn.icl.utk.edu:81156] [[6128,0],0] state:dvm releasing proc [[6128,14],2] from node saturn
[saturn.icl.utk.edu:81156] [[6128,0],0] state:dvm:check_job_completed state is terminated - activating notify
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATE JOB [6128,14] STATE NOTIFY COMPLETED AT state_dvm.c:588
[saturn.icl.utk.edu:81156] [[6128,0],0] ACTIVATING JOB [6128,14] STATE NOTIFY COMPLETED PRI 4
[saturn.icl.utk.edu:81156] [[6128,0],0] iof:hnp:read handler [[6128,14],2] Error on connection:45
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],2] of size 0 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 0 bytes from stdout of [[6128,14],2]
[saturn.icl.utk.edu:81156] [[6128,0],0] sending data from proc [[6128,14],2] of size 0 via PMIx to tool [[6128,13],0]
[saturn.icl.utk.edu:81156] [[6128,0],0] read 0 bytes from stderr of [[6128,14],2]
prte: base/iof_base_frame.c:195: orte_iof_base_proc_destruct: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (ptr->revstderr))->obj_magic_id' failed.

PROC [[6128,14],1] and [[6128,14],3] NORMALLY TERMINATED,
PROC [[6128,14],2] and [[6128,14],0] are still doiong io forwarding when job start to terminate.

Latest git HEAD compilation error

Thank you for taking the time to submit an issue!

Background information

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

git master @ 4301061

Is the reference server using its internal version of PMIx, or an external one?

ext

If external, what version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

git master @ 3f81378fc76c12c6564c2fce2c69608a286a1707

Describe how PMIx was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

Please describe the system on which you are running

  • Operating system/version: CentOS 7
  • Computer hardware: x86_64
  • Network type: IBV

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

make[2]: Entering directory `/gpfs/home/arcurtis/src/prrte/build/orte/tools/prun'
depbase=`echo prun.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I../../../../prrte-git/orte/tools/prun -I../../../opal/include   -I../../../../prrte-git -I../../.. -I../../../../prrte-git/opal/include -I../../../../prrte-git/orte/include -I../../../orte/include -I/gpfs/home/arcurtis/opt/pmix/git/include   -I/gpfs/projects/ChapmanGroup/opt/libevent/include -I/gpfs/home/arcurtis/opt/pmix/git/include  -I/gpfs/home/arcurtis/opt/hwloc/2.0.1/include  -DNDEBUG -ggdb -fno-strict-aliasing -mcx16 -pthread -g -MT prun.o -MD -MP -MF $depbase.Tpo -c -o prun.o ../../../../prrte-git/orte/tools/prun/prun.c &&\
mv -f $depbase.Tpo $depbase.Po
In file included from /gpfs/home/arcurtis/opt/pmix/git/include/pmix_common.h:2281:0,
                 from /gpfs/home/arcurtis/opt/pmix/git/include/pmix.h:52,
                 from ../../../../prrte-git/opal/pmix/pmix-internal.h:32,
                 from ../../../../prrte-git/orte/tools/prun/prun.c:61:
../../../../prrte-git/orte/tools/prun/prun.c: In function ‘prun’:
../../../../prrte-git/orte/tools/prun/prun.c:656:34: error: ‘PMIX_LAUNCHER_RENDEZVOUS_FILE’ undeclared (first use in this function)
         PMIX_INFO_LOAD(ds->info, PMIX_LAUNCHER_RENDEZVOUS_FILE, param, PMIX_STRING);
                                  ^
/gpfs/home/arcurtis/opt/pmix/git/include/pmix_extend.h:110:22: note: in definition of macro ‘PMIX_INFO_LOAD’
         if (NULL != (k)) {                                  \
                      ^
../../../../prrte-git/orte/tools/prun/prun.c:656:34: note: each undeclared identifier is reported only once for each function it appears in
         PMIX_INFO_LOAD(ds->info, PMIX_LAUNCHER_RENDEZVOUS_FILE, param, PMIX_STRING);
                                  ^
/gpfs/home/arcurtis/opt/pmix/git/include/pmix_extend.h:110:22: note: in definition of macro ‘PMIX_INFO_LOAD’
         if (NULL != (k)) {                                  \
                      ^
make[2]: *** [prun.o] Error 1
make[2]: Leaving directory `/gpfs/home/arcurtis/src/prrte/build/orte/tools/prun'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/gpfs/home/arcurtis/src/prrte/build/orte'
make: *** [all-recursive] Error 1

PRTE server crashes when multiple applications run back-to-back

Background information

With multiple applications running back-to-back, the PRTE server crashes at random point. On a few occasions, the server just got stuck and the application launched did not terminate. I am testing this with the Sandia OpenSHMEM unit tests by running "make check" after the server is launched. The issue occurs much less frequently when the PRTE server is launched and terminated for each application separately. Detailed configuration and outputs are mentioned below.

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

libevent 2.0.22-stable
hwloc 2.0.2
PMIx v3.1 (commit 7680895b0c5dec9b42206ddee35c80fb1683f6ca)
prte (PMIx Reference RTE) 3.0.0rc1

Please describe the system on which you are running

  • Operating system/version: CentOS Linux release 7.3.1611
  • Computer hardware: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
  • Network type: Intel(R) Omni-Path 100

Details of the problem

Here are the steps that are leading to this crash, assuming Sandia OpenSHMEM is downloaded and configured in sandia-shmem-basedir

prte &
[1] 7328
DVM Ready

cd sandia-shmem-basedir
make check

Below is the last part of the output that is collected while PRTE was run with "-d" flag.

[node1:03550] [[18102,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_DVM_CLEANUP_JOB_CMD
[node1:03550] [[18102,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
[node1:03550] [[18102,0],0] Releasing job data for [INVALID]
[node1:03550] [[18102,0],0] Releasing job data for [INVALID]
[node1:03550] sess_dir_finalize: proc session dir does not exist
[node1:03550] sess_dir_finalize: job session dir does not exist
[node1:03550] sess_dir_finalize: jobfam session dir not empty - leaving
[node1:03550] sess_dir_finalize: jobfam session dir not empty - leaving
[node1:03550] sess_dir_finalize: top session dir not empty - leaving
[node1:03550] sess_dir_finalize: proc session dir does not exist
[node1:03550] sess_dir_finalize: job session dir does not exist
[node1:03550] sess_dir_finalize: jobfam session dir not empty - leaving
[node1:03550] sess_dir_finalize: jobfam session dir not empty - leaving
[node1:03550] sess_dir_finalize: top session dir not empty - leaving
[node1:03550] [[18102,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_DVM_CLEANUP_JOB_CMD
[node1:03550] sess_dir_finalize: proc session dir does not exist
[node1:03550] sess_dir_finalize: job session dir does not exist
[node1:03550] sess_dir_finalize: jobfam session dir not empty - leaving
[node1:03550] sess_dir_finalize: jobfam session dir not empty - leaving
[node1:03550] sess_dir_finalize: top session dir not empty - leaving
[node1:03550] [[18102,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_DVM_CLEANUP_JOB_CMD
[node1:03550] [[18102,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
[node1:03550] [[18102,0],0] Releasing job data for [INVALID]
[node1:03550] [[18102,0],0] Releasing job data for [INVALID]
[node1:03550] pmix_ptl_base: send_msg: write failed: Bad address (14) [sd = 25]
*** Error in `prte': munmap_chunk(): invalid pointer: 0x00007f24d0053a28 ***
======= Backtrace: =========
/usr/lib64/libc.so.6(+0x7ada4)[0x7f24ddbeada4]
/home/rahmanmd/prrte-install-trial/pmix-3.1/pmix-install/lib/libpmix.so.2(pmix_ptl_base_send_handler+0x3a5)[0x7f24df354a46]
/home/rahmanmd/prrte-install-trial/libevent-2.0.22-stable/libevent-install/lib/libevent-2.0.so.5(event_base_loop+0x812)[0x7f24dee2be82]
/home/rahmanmd/prrte-install-trial/pmix-3.1/pmix-install/lib/libpmix.so.2(+0x94931)[0x7f24df2fb931]
/usr/lib64/libpthread.so.0(+0x7dc5)[0x7f24ddf38dc5]
/usr/lib64/libc.so.6(clone+0x6d)[0x7f24ddc6773d]

prun race on failure of spawned job

@rhc54 @jjhursey and @jsquyres diagnosed a race condition when a job fails to launch.

E.g.:

$ prte --daemonize
$ prun some_executable_that_emits_stderr_and_fails_immediately

This may hang, and may or may not produce output.

It looks like prun is still stuck in the PMIX spawn API call. Looking at prte --mca state_base_verbose 5, it looks like the remote daemons reported the termination correctly, but it looks like there's a race where prun may not have completed PMIX spawn yet, and therefore somehow missed the termination notification.

@jjhursey said he'd have a look.

Scalability problem

Thank you for taking the time to submit an issue!

Background information

Sorry if opening the ticket goes again the community rules, I am not quite sure I have yet enough information to have a useful ticket.
I am trying to run scalability tests on various OLCF systems at ORNL to cover some of the needs of some of our users. My current test consist of starting N PEs on X nodes where N is the number of cores available on compute nodes times the number of nodes; basically filling up compute nodes and trying to find the upper limit regarding the number of nodes before we start to face problems. At the moment, I am trying to find the value of X where I start to face problems. For every test, I am running hostname and to validate the test, I count the number of host names in the output. I acknowledge this might be the best test but it captures the needs from a user; I am willing to run other tests to capture scalability problems. I can also share my test.

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

PRRTE master fd34cfa

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

PMIx master 30c51d72c74f0d225cd60aa8e4ce46054e44603d

Please describe the system on which you are running

  • Operating system/version:
$ lsb_release -a
LSB Version:	:core-4.1-noarch:core-4.1-ppc64le
Distributor ID:	RedHatEnterpriseServer
Description:	Red Hat Enterprise Linux Server release 7.5 (Maipo)
Release:	7.5
Codename:	Maipo
  • Computer hardware:
    IBM Power8
  • Network type:
    Mellanox ConnectX-4

Details of the problem

I am currently running my tests on Summitdev at ORNL.
My test runs the following loop starting with 32 nodes and 20 PEs per node (one PE per core):

  • if the test fails, try with half the number of nodes
  • if the test succeeds, try with twice the number of nodes
  • the test stops when we fall back to a single node and when we are back at a number of nodes for which we already had a failure
  • all PEs run hostname
  • a test is considered as successful when the output has the same count of host names as the number of launched PEs.

On Summitdev, I get the following in a very consistent manner:

  • 32 nodes -> failure
  • 16 nodes -> failure
  • 8 nodes -> success
  • 16 nodes -> failure

For the last run with 16 nodes, I get the following error (I do not track error messages for all runs at the moment): [summitdev-login1:11547] PMIX ERROR: OUT-OF-RESOURCE in file /ccs/home/gvh/scratch/summitdev/prrte/pmix/src/src/server/pmix_server.c at line 1785 User defined signal 2
Based on this, I am suspecting a problem in the mapper since it should have all the required resources available.

The LSF script to start a job on 16 nodes looks like:

#!/bin/bash
# Begin LSF directives
#BSUB -P *****
#BSUB -J dvm_simple
#BSUB -o dvm_simple.out
#BSUB -e dvm_simple.err
#BSUB -W 00:10
#BSUB -nnodes 16
#BSUB -env "all"
# End LSF directives and begin shell commands

./get_list_hosts.pl

T="$(date +%s)"

echo "Starting DVM on 16 nodes..." >> ./dvm_simple_config.log
prte --prefix $PRRTE_DIR --report-uri prrteuri --hostfile ./DVM_HOSTS.txt &
echo "DVM started" >> ./dvm_simple_config.log

echo "Running job with 320 PEs..." >> ./dvm_simple_config.log
prun --prefix $PRRTE_DIR -np 320 hostname
echo "Job succeeded"  >> ./dvm_simple_config.log

echo "Sleeping for 30 seconds to give a chance to all messages to come back from the nodes..."
sleep 30

echo "Terminating DVM..." >> ./dvm_simple_config.log
prun --prefix $PRRTE_DIR -terminate
echo "DVM teminated" >> ./dvm_simple_config.log

T="$(($(date +%s)-T))"
echo "Total job runtime: $T seconds" >> ./dvm_simple_config.log

Note that I included a sleep 30 to give a chance at the system to propagate back all the IO since I believe there is no IO flush in PRRTE at the moment.
This is using a PRRTE module that I generate for the system, PRRTE_DIR points at the install directory for PRRTE.

I will try to run the same test on Summit and Titan to see if I face the same limitations (these systems allow a different number of PEs per node).

Please let me know if you need any additional information, I will be happy to run any test to track this scalability problem.

Double --prefix failure with wrapped prun

shell$ echo $PRRTE_ROOT
/install/prrte-master-x-master-dbg
shell$ cd $PRRTE_ROOT/bin
shell$ ln -s prun mpirun
shell$ cd $HOME
shell$ mpirun --map-by ppr:2:node --prefix $PRRTE_ROOT  ./hello
--------------------------------------------------------------------------
Both a prefix was supplied to  and the absolute path to  was
given:

  Prefix: /install/prrte-master-x-master-dbg
  Path:   /install/prrte-master-x-master-dbg/bin

Only one should be specified to avoid potential version
confusion. Operation will continue, but the -prefix option will be
used. This is done to allow you to select a different prefix for
the backend computation nodes than used on the frontend for .
--------------------------------------------------------------------------
sh: /install/prrte-master-x-master-dbg/prte: No such file or directory
mpirun failed to initialize, likely due to no DVM being available

A couple of items here.

  • The help message (prun:double-prefix] is missing some string values.
  • I don't think this should trigger an error since the first prefix is implicit with the mpirun through the soft link.

This was found my Open MPI MTT testing which tends to rely on the prefix to help set the specific build for that run.

Compile fails

Thank you for taking the time to submit an issue!

Background information

compilation fails

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

git master @ b7fdd9b

Is the reference server using its internal version of PMIx, or an external one?

external

If external, what version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

github master @ 6c18b47a34621972bb4ab9cfd19a27f1f3587e97

Describe how PMIx was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

Please describe the system on which you are running

  • Operating system/version: CentOS 7
  • Computer hardware: x86_64
  • Network type: ibv

Details of the problem

  CC       orted/pmix/pmix_server.lo
  CC       orted/pmix/pmix_server_fence.lo
  CC       orted/pmix/pmix_server_register_fns.lo
  CC       orted/pmix/pmix_server_dyn.lo
  CC       orted/pmix/pmix_server_pub.lo
  CC       orted/pmix/pmix_server_gen.lo
../../psrvr-git/orte/orted/pmix/pmix_server_gen.c: In function 'pmix_server_notify':
../../psrvr-git/orte/orted/pmix/pmix_server_gen.c:418:18: warning: assignment from incompatible pointer type [enabled by default]
         cd->info = OBJ_NEW(opal_list_t);
                  ^
../../psrvr-git/orte/orted/pmix/pmix_server_gen.c:420:5: error: 'val' undeclared (first use in this function)
     val = OBJ_NEW(opal_value_t);
     ^
../../psrvr-git/orte/orted/pmix/pmix_server_gen.c:420:5: note: each undeclared identifier is reported only once for each function it appears in
../../psrvr-git/orte/orted/pmix/pmix_server_gen.c:424:5: warning: passing argument 1 of '_opal_list_append' from incompatible pointer type [enabled by default]
     opal_list_append(cd->info, &val->super);
     ^
In file included from ../../psrvr-git/opal/dss/dss_types.h:42:0,
                 from ../../psrvr-git/opal/dss/dss.h:32,
                 from ../../psrvr-git/orte/orted/pmix/pmix_server_gen.c:37:
../../psrvr-git/opal/class/opal_list.h:544:20: note: expected 'struct opal_list_t *' but argument is of type 'struct pmix_info_t *'
 static inline void _opal_list_append(opal_list_t *list, opal_list_item_t *item
                    ^
../../psrvr-git/orte/orted/pmix/pmix_server_gen.c: In function 'pmix_server_notify_event':
../../psrvr-git/orte/orted/pmix/pmix_server_gen.c:461:23: error: 'val' undeclared (first use in this function)
     OPAL_LIST_FOREACH(val, info, opal_value_t) {
                       ^
../../psrvr-git/opal/class/opal_list.h:215:8: note: in definition of macro 'OPAL_LIST_FOREACH'
   for (item = (type *) (list)->opal_list_sentinel.opal_list_next ;      \
        ^
../../psrvr-git/opal/class/opal_list.h:215:30: error: 'pmix_info_t' has no member named 'opal_list_sentinel'
   for (item = (type *) (list)->opal_list_sentinel.opal_list_next ;      \
                              ^
../../psrvr-git/orte/orted/pmix/pmix_server_gen.c:461:5: note: in expansion of macro 'OPAL_LIST_FOREACH'
     OPAL_LIST_FOREACH(val, info, opal_value_t) {
     ^
../../psrvr-git/opal/class/opal_list.h:216:32: error: 'pmix_info_t' has no member named 'opal_list_sentinel'
        item != (type *) &(list)->opal_list_sentinel ;                   \
                                ^
../../psrvr-git/orte/orted/pmix/pmix_server_gen.c:461:5: note: in expansion of macro 'OPAL_LIST_FOREACH'
     OPAL_LIST_FOREACH(val, info, opal_value_t) {
     ^
make[2]: *** [orted/pmix/pmix_server_gen.lo] Error 1
make[2]: Leaving directory `/gpfs/home/arcurtis/src/prrte/build/orte'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/gpfs/home/arcurtis/src/prrte/build/orte'
make: *** [all-recursive] Error 1

pcc is not using the right pmix.h

Background information

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

prrte version: master @ 1aec6c4

Is the reference server using its internal version of PMIx, or an external one?

external

If external, what version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

pmix : master @ be15631db82cf9b3fd5078f1336812de0b500838

Describe how PMIx was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

Details of the problem

When i use pcc to compile the ompi/examples, it not using the external pmix.h.
with $pcc --show-me
the -I/external_pmix_install_path/include is missing.

prrte problem when runing applications on multiple nodes

Thank you for taking the time to submit an issue!

Background information

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

git master @ ffe3dd3

Is the reference server using its internal version of PMIx, or an external one?

external

If external, what version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

git master @ 3f81378fc76c12c6564c2fce2c69608a286a1707

Describe how PMIx was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone (with external libevent, pmix, enable-debug)

Please describe the system on which you are running

  • Operating system/version:Scientific Linux 7.4 (Nitrogen)
  • Computer hardware:x86-64
  • Network type: infiniband

Details of the problem

allocate 2 nodes
salloc -k -N 2
Start the dvm using
$prte -pmca pmix ext4x -pmca pmix_server_base_verbose 10 -debug-daemons
running example log.c under prrte/examples
$prun -np 4 log --global-syslog

DVM ready
[phi.icl.utk.edu:19807] SWITCHYARD for 578093057:0:27
[phi.icl.utk.edu:19807] recvd pmix cmd JOB CONTROL from 578093057:0
[phi.icl.utk.edu:19807] recvd job control request from client
[phi.icl.utk.edu:19807] SWITCHYARD for 578093057:0:27
[phi.icl.utk.edu:19807] recvd pmix cmd REGISTER EVENT HANDLER from 578093057:0
[phi.icl.utk.edu:19807] server:regevents_cbfunc called status = 0
[phi.icl.utk.edu:19807] SWITCHYARD for 578093057:0:27
[phi.icl.utk.edu:19807] recvd pmix cmd SPAWN from 578093057:0
[phi.icl.utk.edu:19807] [[8821,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
[phi.icl.utk.edu:19807] [[8821,0],0] orted_cmd: received add_local_procs
[phi.icl.utk.edu:19807] pmix:server _register_nspace 578093058
[helium.phi:20416] [[8821,0],1] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
[helium.phi:20416] [[8821,0],1] orted_cmd: received add_local_procs
[lithium.phi:18531] [[8821,0],2] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
[lithium.phi:18531] [[8821,0],2] orted_cmd: received add_local_procs
[helium.phi:20416] pmix:server register client 578093058:0
[helium.phi:20416] pmix:server register client 578093058:1
[helium.phi:20416] pmix:server register client 578093058:2
[helium.phi:20416] pmix:server register client 578093058:3
[helium.phi:20416] pmix:server _register_client for nspace 578093058 rank 0
[helium.phi:20416] pmix:server _register_client for nspace 578093058 rank 1
[helium.phi:20416] pmix:server _register_client for nspace 578093058 rank 2
[helium.phi:20416] pmix:server _register_client for nspace 578093058 rank 3
[lithium.phi:18531] pmix:server _register_nspace 578093058
[helium.phi:20416] pmix:server _register_nspace 578093058
[helium.phi:20416] pmix:server setup_fork for nspace 578093058 rank 0
[helium.phi:20416] pmix:server setup_fork for nspace 578093058 rank 1
[helium.phi:20416] pmix:server setup_fork for nspace 578093058 rank 2
[helium.phi:20416] pmix:server setup_fork for nspace 578093058 rank 3
[phi.icl.utk.edu:19807] SWITCHYARD for 578093057:0:27
[phi.icl.utk.edu:19807] recvd pmix cmd REGISTER EVENT HANDLER from 578093057:0
[phi.icl.utk.edu:19807] server:regevents_cbfunc called status = 0
[helium.phi:20416] SWITCHYARD for 578093058:0:22
[helium.phi:20416] recvd pmix cmd REQUEST INIT INFO from 578093058:0
[helium.phi:20416] SWITCHYARD for 578093058:1:23
[helium.phi:20416] recvd pmix cmd REQUEST INIT INFO from 578093058:1
[helium.phi:20416] SWITCHYARD for 578093058:0:22
[helium.phi:20416] recvd pmix cmd LOG from 578093058:0
[helium.phi:20416] recvd log from client
prted: orted/pmix/pmix_server_gen.c:1204: pmix_server_log_fn: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (&bo))->obj_magic_id' failed.
srun: error: helium: task 0: Aborted (core dumped)
srun: Terminating job step 7723.0
[lithium.phi:18531] [[8821,0],2]:base/ess_base_std_orted.c(676) updating exit status to 1
(null): Forwarding signal 18 to job
[lithium.phi:18531] pmix:server finalize called
[lithium.phi:18531] pmix:server finalize complete
srun: error: lithium: task 1: Exited with exit code 1
[phi.icl.utk.edu:19807] pmix:server finalize called

This works fine for 1 node, it only happens when you have multiple nodes. For my test, i use 2 nodes.

Occasional error messages from client at program start

Thank you for taking the time to submit an issue!

Background information

What version of the PMIx Reference Library are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

PMIx server: github master @ 315681d

PMIx client: github master @ ef2575f3ac21a3261da16d827fe2efd27b46151c

Describe how PMIx was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git-clone

Please describe the system on which you are running

  • Operating system/version: CentOS 7
  • Computer hardware: x86_64
  • Network type: IB

Details of the problem

At program start, I occasionally see these messages from the client code

[cn090:05314] [[43133,0],1] ORTE_ERROR_LOG: Not found in file ../../prrte-git/orte/util/nidmap.c at line 761
[cn090:05314] [[43133,0],1] ORTE_ERROR_LOG: Not found in file ../../prrte-git/orte/orted/orted_comm.c at line 270

the program still continues to run fine though.

Compilation error in ptrace() in odls_default_module.c on FreeBSD

Thank you for taking the time to submit an issue!

Background information

Compilation error on FeeBSD 12 in ptrace() call

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

github master @ d64505a

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

github master @ a3cfa97da6983a33411e367f6a250964cff1dc55

Please describe the system on which you are running

  • Operating system/version: FreeBSD 12.1
  • Computer hardware: x86_64
  • Network type: N/A

Details of the problem

Making all in mca/odls/default
  CC       odls_default_module.lo
odls_default_module.c: In function 'do_parent':
odls_default_module.c:473:20: error: 'PTRACE_DETACH' undeclared (first use in this function); did you mean 'PRRTE_DETACH'?
  473 |             ptrace(PTRACE_DETACH, cd->child->pid, 0, (void*)SIGSTOP);
      |                    ^~~~~~~~~~~~~
      |                    PRRTE_DETACH
odls_default_module.c:473:20: note: each undeclared identifier is reported only once for each function it appears in
odls_default_module.c:473:54: warning: passing argument 4 of 'ptrace' makes integer from pointer without a cast [-Wint-conversion]
  473 |             ptrace(PTRACE_DETACH, cd->child->pid, 0, (void*)SIGSTOP);
      |                                                      ^
      |                                                      |
      |                                                      void *
In file included from odls_default_module.c:113:
/usr/include/sys/ptrace.h:220:57: note: expected 'int' but argument is of type 'void *'
  220 | int ptrace(int _request, pid_t _pid, caddr_t _addr, int _data);
      |                                                     ~~~~^~~~~
*** Error code 1

Build error: orte/mca/schizo/singularity/configure.m4 does not exists

Thank you for taking the time to submit an issue!

Background information

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

master branch (commit 891a7dd).

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

PMIx 3.1.2

Please describe the system on which you are running

  • Operating system/version: Debian
  • Computer hardware:
  • Network type:

Details of the problem

When doing configure then make, the make command fails immediately with the following error:

CDPATH="${ZSH_VERSION+.}:" && cd . && /bin/bash /home/mdorier/prrte/config/missing aclocal-1.15 -I config
aclocal-1.15: error: config/autogen_found_items.m4:180: file 'orte/mca/schizo/singularity/configure.m4' does not exist
Makefile:878: recipe for target 'aclocal.m4' failed
make: *** [aclocal.m4] Error 1

Job failed to spawn: UNREACHABLE

Thank you for taking the time to submit an issue!

Background information

After a recent update from git-pull, started getting the above error message when launching programs.

N.B. continues to work if PMIx configured with --enable-debug.

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

git master @ e886f1d

Is the reference server using its internal version of PMIx, or an external one?

external

If external, what version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

git master @ f894bfce36d11913e81f05b54da0f1fead8c3701

Describe how PMIx was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

clone

Please describe the system on which you are running

  • Operating system/version: CentOS 7
  • Computer hardware: x86_64
  • Network type: IBV

Details of the problem

When running through prun I get:

arcurtis@cn-mem[1](~/shmem/novo-test) prte -pmca pmix_server_verbose 99 -pmca orte_data_server_verbose 99 -pmca orte_report_silent_errors 1 -pmca odls_base_verbose 99 &
[1] 34302
arcurtis@cn-mem[1](~/shmem/novo-test) [cn-mem:34302] mca: base: components_register: registering framework odls components
[cn-mem:34302] mca: base: components_register: found loaded component default
[cn-mem:34302] mca: base: components_register: component default has no register or open function
[cn-mem:34302] mca: base: components_open: opening odls components
[cn-mem:34302] mca: base: components_open: found loaded component default
[cn-mem:34302] mca: base: components_open: component default open function successful
[cn-mem:34302] mca:base:select: Auto-selecting odls components
[cn-mem:34302] mca:base:select:( odls) Querying component [default]
[cn-mem:34302] mca:base:select:( odls) Query of component [default] set priority to 10
[cn-mem:34302] mca:base:select:( odls) Selected component [default]
DVM ready

arcurtis@cn-mem[1](~/shmem/novo-test) prun -v -n 1 ./a.out
[cn-mem:34302] [[38001,0],0] TOOL CONNECTION REQUEST RECVD
[cn-mem:34302] [[38001,0],0] TOOL CONNECTION PROCESSING
[cn-mem:34302] [[38001,0],0] TOOL CONNECTION FROM UID 170008941 GID 170008941
[cn-mem:34302] [[38001,0],0] spawn called from proc [[38001,1],0]
[cn-mem:34302] *** Process received signal ***
[cn-mem:34302] Signal: Segmentation fault (11)
[cn-mem:34302] Signal code: Address not mapped (1)
[cn-mem:34302] Failing at address: 0x30
[cn-mem:34302] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2aaaace03370]
[cn-mem:34302] [ 1] /gpfs/home/arcurtis/opt/pmix/git/lib/pmix/mca_pnet_tcp.so(+0x24fc)[0x2aaab13334fc]
[cn-mem:34302] [ 2] /gpfs/home/arcurtis/opt/pmix/git/lib/pmix/mca_pnet_tcp.so(+0x6c3e)[0x2aaab1337c3e]
[cn-mem:34302] [ 3] /gpfs/home/arcurtis/opt/pmix/git/lib/libpmix.so.0(pmix_pnet_base_allocate+0x190)[0x2aaaab4b4550]
[cn-mem:34302] [ 4] /gpfs/home/arcurtis/opt/pmix/git/lib/libpmix.so.0(+0x56c26)[0x2aaaab462c26]
[cn-mem:34302] [ 5] /gpfs/projects/ChapmanGroup/opt/libevent/lib/libevent-2.1.so.6(+0x2153d)[0x2aaaab90b53d]
[cn-mem:34302] [ 6] /gpfs/projects/ChapmanGroup/opt/libevent/lib/libevent-2.1.so.6(event_base_loop+0x3ef)[0x2aaaab90bc4f]
[cn-mem:34302] [ 7] /gpfs/home/arcurtis/opt/pmix/git/lib/libpmix.so.0(+0x7348e)[0x2aaaab47f48e]
[cn-mem:34302] [ 8] /lib64/libpthread.so.0(+0x7dc5)[0x2aaaacdfbdc5]
[cn-mem:34302] [ 9] /lib64/libc.so.6(clone+0x6d)[0x2aaaad10776d]
[cn-mem:34302] *** End of error message ***
[cn-mem:34324] Job failed to spawn: UNREACHABLE
[1]+  Segmentation fault      (core dumped) prte -pmca pmix_server_verbose 99 -pmca orte_data_server_verbose 99 -pmca orte_report_silent_errors 1 -pmca odls_base_verbose 99

GDB of prte:

gdb) r
Starting program: /gpfs/home/arcurtis/opt/prrte/git/bin/prte
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x2aaaae010700 (LWP 4311)]
[New Thread 0x2aaab1562700 (LWP 4312)]
[New Thread 0x2aaab2185700 (LWP 4313)]
[New Thread 0x2aaab2386700 (LWP 4314)]
DVM ready

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x2aaaae010700 (LWP 4311)]
0x00002aaab115a4cc in pmix_obj_run_destructors (
    object=0x2aaab13613f0 <available+16>)
    at /gpfs/home/arcurtis/src/pmix/pmix-git/src/class/pmix_object.h:452
452	    cls_destruct = object->obj_class->cls_destruct_array;
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.163-3.el7.x86_64 elfutils-libs-0.163-3.el7.x86_64 glibc-2.17-157.el7_3.5.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.13.2-10.el7.x86_64 libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64 libgcc-4.8.5-4.el7.x86_64 libselinux-2.2.2-6.el7.x86_64 libstdc++-4.8.5-4.el7.x86_64 libxml2-2.9.1-6.el7_2.2.x86_64 openssl-libs-1.0.1e-51.el7_2.4.x86_64 pcre-8.32-15.el7.x86_64 systemd-libs-219-19.el7.x86_64 xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) where
#0  0x00002aaab115a4cc in pmix_obj_run_destructors (
    object=0x2aaab13613f0 <available+16>)
    at /gpfs/home/arcurtis/src/pmix/pmix-git/src/class/pmix_object.h:452
#1  ttdes (p=0x2aaab4034bc0)
    at ../../../../../pmix-git/src/mca/pnet/tcp/pnet_tcp.c:181
#2  0x00002aaab115ec0e in pmix_obj_run_destructors (object=0x2aaab4034bc0)
    at /gpfs/home/arcurtis/src/pmix/pmix-git/src/class/pmix_object.h:454
#3  allocate (nptr=0x2aaab4033f80, info=<optimized out>, ilist=0x2aaaae00fd70)
    at ../../../../../pmix-git/src/mca/pnet/tcp/pnet_tcp.c:620
#4  0x00002aaaab4b7290 in pmix_pnet_base_allocate (nspace=<optimized out>,
    info=<optimized out>, ninfo=<optimized out>, ilist=<optimized out>)
    at ../../../../pmix-git/src/mca/pnet/base/pnet_base_fns.c:121
#5  0x00002aaaab464d96 in _setup_app (sd=<optimized out>,
    args=<optimized out>, cbdata=0x822a00)
    at ../../pmix-git/src/server/pmix_server.c:1461
#6  0x00002aaaab90e53d in event_process_active_single_queue (
    base=base@entry=0x709950, activeq=0x709da0,
    max_to_process=max_to_process@entry=2147483647, endtime=endtime@entry=0x0)
    at event.c:1646
#7  0x00002aaaab90ec4f in event_process_active (base=0x709950) at event.c:1738
#8  event_base_loop (base=0x709950, flags=flags@entry=1) at event.c:1961
#9  0x00002aaaab4815fe in progress_engine (obj=<optimized out>)
    at ../../pmix-git/src/runtime/pmix_progress_threads.c:109
#10 0x00002aaaacbfddc5 in start_thread () from /lib64/libpthread.so.0
#11 0x00002aaaacf0976d in clone () from /lib64/libc.so.6

(PMIx and PRRTE both using same version of hwloc.)

prrte segfaulting with latest PMIx update

Thank you for taking the time to submit an issue!

Background information

Just installed latest PMIx update from github, prrte now segfaults.

(Wes [wessle] is working with me, BTW, this is all related)

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

git master @ ffe3dd3

Is the reference server using its internal version of PMIx, or an external one?

External

If external, what version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

git master @ a1d3610c2b0eadf68948eead2ec64fc29d799a9e

Describe how PMIx was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

Please describe the system on which you are running

  • Operating system/version: CentOS 7
  • Computer hardware: x86_64
  • Network type: ibv

Details of the problem

Run prrte to get DVM, then

$ prun -n 1 pmix-client-program

generates this from prrte:

(gdb) r
Starting program: /opt/prrte/bin/prte 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff1c07700 (LWP 52013)]
[New Thread 0x7ffff0dee700 (LWP 52014)]
[New Thread 0x7fffefbcb700 (LWP 52015)]
[New Thread 0x7fffef3ca700 (LWP 52016)]
DVM ready

Thread 2 "prte" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff1c07700 (LWP 52013)]
query_cbfunc (status=<optimized out>, status@entry=0, info=info@entry=0x0, 
    ninfo=<optimized out>, ninfo@entry=0, cbdata=0x7fffe800db70, 
    release_fn=release_fn@entry=0x0, release_cbdata=release_cbdata@entry=0x0)
    at ../../pmix-git/src/server/pmix_server.c:2748
2748	../../pmix-git/src/server/pmix_server.c: No such file or directory.
(gdb) bt
#0  query_cbfunc (status=<optimized out>, status@entry=0, info=info@entry=0x0, 
    ninfo=<optimized out>, ninfo@entry=0, cbdata=0x7fffe800db70, 
    release_fn=release_fn@entry=0x0, release_cbdata=release_cbdata@entry=0x0)
    at ../../pmix-git/src/server/pmix_server.c:2748
#1  0x00007ffff741d3ce in pmix_server_job_ctrl (
    peer=peer@entry=0x7fffe800e9c0, buf=buf@entry=0x7ffff1c06c80, 
    cbfunc=cbfunc@entry=0x7ffff73fd1a0 <query_cbfunc>, cbdata=<optimized out>)
    at ../../pmix-git/src/server/pmix_server_ops.c:2541
#2  0x00007ffff7402f4a in server_switchyard (peer=peer@entry=0x7fffe800e9c0, 
    tag=101, buf=buf@entry=0x7ffff1c06c80)
    at ../../pmix-git/src/server/pmix_server.c:3196
#3  0x00007ffff7403897 in pmix_server_message_handler (pr=0x7fffe800e9c0, 
    hdr=0x7fffe800da08, buf=0x7ffff1c06c80, cbdata=<optimized out>)
    at ../../pmix-git/src/server/pmix_server.c:3246
#4  0x00007ffff746b1be in pmix_ptl_base_process_msg (fd=<optimized out>, 
    flags=<optimized out>, cbdata=0x7fffe800d930)
    at ../../../../pmix-git/src/mca/ptl/base/ptl_base_sendrecv.c:719
#5  0x00007ffff6f74345 in event_process_active_single_queue (
    base=base@entry=0x6ca840, activeq=0x6cac90, 
    max_to_process=max_to_process@entry=2147483647, endtime=endtime@entry=0x0)
    at event.c:1646
#6  0x00007ffff6f74d47 in event_process_active (base=0x6ca840) at event.c:1738
#7  event_base_loop (base=0x6ca840, flags=flags@entry=1) at event.c:1961
#8  0x00007ffff7427fde in progress_engine (obj=<optimized out>)
    at ../../pmix-git/src/runtime/pmix_progress_threads.c:109
#9  0x00007ffff634c594 in start_thread (arg=<optimized out>)
    at pthread_create.c:463
#10 0x00007ffff60800df in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) 

Build error with latest git HEAD on PMIX_SERVER_SCHEDULER

Background information

  • PRRTE master @ bfb0246
  • PMIx v3.1.3
  • Build Host:
    • Operating system/version: Linux ubuntu 16.04
    • Computer hardware: x86-64
    • Network type: ethernet

Details of the problem

Build failure with latest PRRTE master using pmix-3.1.3 due to missing PMIX_SERVER_SCHEDULER in PMIx version that is unguarded in PRRTE. Not sure if this should be a configury check or how PMIx version specific bits are handled in PRRTE.

../configure \
    --prefix=$PKG_INSTALL_PREFIX \
    --with-hwloc=$HWLOC_INSTALL_DIR \
    --with-pmix=$PMIX_INSTALL_DIR \
    --with-libevent=$LIBEVENT_INSTALL_DIR \
    --enable-orterun-prefix-by-default \
&& make -j 4 \
&& make install

...<snip>...

  CC       orted/pmix/pmix_server_pub.lo
In file included from /usr/include/string.h:630:0,
                 from /home/3t4/projects/pmix/ssd-pmix/prrte/install/include/hwloc.h:59,
                 from ../../opal/hwloc/hwloc-internal.h:28,
                 from ../../opal/util/proc.h:22,
                 from ../../orte/include/orte/types.h:30,
                 from ../../orte/orted/pmix/pmix_server.c:30:
../../orte/orted/pmix/pmix_server.c: In function ‘pmix_server_init’:
../../orte/orted/pmix/pmix_server.c:382:26: error: ‘PMIX_SERVER_SCHEDULER’ undeclared (first use in this function)
         kv->key = strdup(PMIX_SERVER_SCHEDULER);
                          ^
../../orte/orted/pmix/pmix_server.c:382:26: note: each undeclared identifier is reported only once for each function it appears in
Makefile:1473: recipe for target 'orted/pmix/pmix_server.lo' failed

Exchanging data with PMIx_Publish/Lookup does not scale well

Background information

What version of the PMIx Reference Server are you using?

github master @ 716be58 (so that it compiles with PMIx 3.1.2)

What version of PMIx are you using?

PMIx 3.1.2 (w/ external hwloc 2.0.3)

Please describe the system on which you are running

  • Operating system/version: CentOS Linux release 7.5.1804 x86_64
  • Computer hardware: Xeon E5-2690 v3
  • Network type:
Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)

Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]

Details of the problem

Hello, we are the maintainers of the OpenSHMEM implementation OSSS-UCX, which uses PMIx to exchange UCX parameters during its start-up.

Details: https://github.com/openshmem-org/osss-ucx/blob/master/src/shmemc/ucx/pmix_client.c

Initially we use PMIx_Publish and PMIx_Lookup to do it, but this approach scales poorly on several HPC clusters that we have tested it on. For a simple hello world program that does nothing other than calling shmem_init() and shmem_finalize(), it takes OSSS-UCX about 120 seconds to do the job on 192 PEs. Below is a trimmed output of Linux kernel's perf profiler.

|--90.89%--orte_rml_base_process_msg
|          |          
|           --90.44%--orte_data_server
|                     |          
|                     |--82.08%--orte_util_print_name_args
|                     |          |          
|                     |          |--32.02%--__snprintf
|                     |          |          |          
|                     |          |           --31.64%--_IO_vsnprintf
|                     |          |                     |          
|                     |          |                     |--27.75%--vfprintf
|                     |          |                     |          
|                     |          |                     |--1.62%--_IO_str_init_static_internal
|                     |          |                     |          
|                     |          |                      --1.52%--_IO_no_init
|                     |          |          
|                     |          |--30.60%--orte_util_print_jobids
|                     |          |          |          
|                     |          |          |--29.22%--__snprintf
|                     |          |          |          |          
|                     |          |          |           --28.81%--_IO_vsnprintf
|                     |          |          |                     |          
|                     |          |          |                     |--25.23%--vfprintf
|                     |          |          |                     |          
|                     |          |          |                     |--1.72%--_IO_str_init_static_internal
|                     |          |          |                     |          
|                     |          |          |                      --1.03%--_IO_no_init
|                     |          |          |          
|                     |          |           --0.60%--get_print_name_buffer
|                     |          |          
|                     |          |--17.00%--orte_util_print_vpids
|                     |          |          |          
|                     |          |          |--15.09%--__snprintf
|                     |          |          |          |          
|                     |          |          |           --14.56%--_IO_vsnprintf
|                     |          |          |                     |          
|                     |          |          |                     |--10.41%--vfprintf
|                     |          |          |                     |          
|                     |          |          |                     |--1.77%--_IO_no_init
|                     |          |          |                     |          
|                     |          |          |                      --1.52%--_IO_str_init_static_internal
|                     |          |          |          
|                     |          |           --0.98%--get_print_name_buffer
|                     |          |                     |          
|                     |          |                      --0.76%--pthread_getspecific
|                     |          |          
|                     |           --1.25%--get_print_name_buffer
|                     |                     |          
|                     |                      --0.80%--pthread_getspecific
|                     |          
|                     |--3.78%--__strncmp_sse42
|                     |          
|                     |--0.80%--pthread_mutex_unlock
|                     |          
|                      --0.67%--pthread_mutex_lock

Apparently, the function orte_data_server was called many many times and 90% of the total run time was spend on it.

Looking closer, the function orte_util_print_name_args (ORTE_NAME_PRINT) is the most expensive part, as it always formats the log strings even if no log gets printed.

I forked prrte and removed all the lines in orte_data_server that contains ORTE_NAME_PRINT, and this reduced the total run time of the hello world program to around 20 seconds (orte_data_server is still being called many many times).

In the development branch of OSSS-UCX we have switched to PMIx_Get/Put/Commit and now it only takes about 10 seconds to run the hello world program on 192 PEs without needing to remove the string formatting macro.

New version: https://bitbucket.org/wenblu/osss-ucx/src/master/src/shmemc/ucx/pmix_client.c

@tonycurtis

OMPI integration issues to be resolved

There are a number of things that need to be done to complete the OMPI integration effort. I'm going to list them here for tracking purposes and in the hope that others might pick some of them up. If you do, please edit this comment and put your name at the beginning of the item you are working on so we avoid duplicate effort. Obviously, there will be some "ompi" items in this list. This is a "living" list, so expect more things to be added as they are identified.

  • [@rhc54] Revise command line setup/parsing. Need to expand it a bit to allow for multiple command line definitions. Need to handle different MCA params for OMPI vs PRRTE.

  • Singleton support. IIRC, I enabled PMIx_Init to support singletons - i.e., when the client is not launched by a daemon and thus has no contact information for a PMIx server. However, I didn't do anything about the case of singleton comm_spawn where the client needs to start a PMIx server and then connect back to it.

  • Resolve reported comm_spawn issues. Multiple reports of comm_spawn problems on the OMPI mailing lists and issues. Includes missing support for various MPI_Info arguments such as "add_hostfile" that may (likely) require some updates to PRRTE

  • Decide what to do about legacy ORTE MCA params. These probably need to be detected and converted to their PRRTE equivalent

  • Update PRRTE frameworks to use MCA params solely for setting default behavior, overridden on a per-job basis by user specifications.

  • [@jsquyres] Come up with a way for "ompi_info" to include PRRTE information

  • Resolve multi-mpirun connect/accept issues - do we auto-detect the presence of another DVM and launch within it, or do we launch a 2nd DVM and "connect" between them, or...?

  • Devise support for user obtaining an MPI "port", printing it out, and then feeding it to another mpirun on the cmd line for connect/accept

prte: prefix environment not propogated properly

Background information

What version of the PMIx Reference Server are you using?

git prrte fc30acb (via latest master of open-mpi/ompi@960c5f7)

What version of PMIx are you using?

pmix4x (via latest ompi master)

Please describe the system on which you are running

  • Operating system/version: Linux
  • Computer hardware: x86_64
  • Network type: TCP

Details of the problem

Environment is not propagated to remote hosts during the SSH launch. This appears to be an issue with the --enable-orterun-prefix-by-default at compile time or when using --prefix at runtime.

Configured with VPATH (via ompi build)

./autogen.pl
cd BUILD-master/
../configure \
    --enable-orterun-prefix-by-default \
    --prefix=${OMPI_INSTALL_DIR} \
    --enable-debug \
&& make -j 4 \
&& make install

Example to reproduce problem

[3t4@node0 BUILD-master]$ hostname
node0
[3t4@node0 BUILD-master]$ more hosts 
node1
node2
[3t4@node0 BUILD-master]$ prte --hostfile hosts &
[1] 9107
[3t4@node0 BUILD-master]$ bash: prted: command not found
bash: prted: command not found
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../../prrte/src/mca/rml/oob/rml_oob_send.c at line 202
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../prrte/src/mca/plm/base/plm_base_launch_support.c at line 632
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../../prrte/src/mca/errmgr/dvm/errmgr_dvm.c at line 418
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../../prrte/src/mca/rml/oob/rml_oob_send.c at line 202
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../../prrte/src/mca/errmgr/dvm/errmgr_dvm.c at line 563
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../../prrte/src/mca/rml/oob/rml_oob_send.c at line 202
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../prrte/src/mca/plm/base/plm_base_launch_support.c at line 632
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../../prrte/src/mca/errmgr/dvm/errmgr_dvm.c at line 418
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../../prrte/src/mca/rml/oob/rml_oob_send.c at line 202
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../prrte/src/mca/plm/base/plm_base_launch_support.c at line 632
[node0:09107] PRRTE ERROR: Bad parameter in file ../../../../../../prrte/src/mca/errmgr/dvm/errmgr_dvm.c at line 418

[1]+  Exit 127                prte --hostfile hosts
[3t4@node0 BUILD-master]$

PMIX clients time out on start-up using PRRTE as launcher

Thank you for taking the time to submit an issue!

Background information

Launch of PMIX clients fails/times-out with latest PRRTE

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

master @ d54aa74

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

3.1.4 release and master @ 80f80b17589232eee49f6807afda2b853aee51d2

Please describe the system on which you are running

  • Operating system/version: CentOS 7.5
  • Computer hardware: x86_64
  • Network type: IB

Details of the problem

Programs launched on > 4 nodes (4 might just be the point at which I'm seeing the issue and have no specific inherent meaning) under PBS/Torque time out with

ORTE has lost communication with a remote daemon.

  HNP daemon   : [[12106,0],0] on node cn099
  Remote daemon: [[12106,0],1] on node cn002

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

Same programs launch immediately using Open-MPI 4.0.x as PMIx server.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.