Comments (17)
It's working fine on the systems I can access, so it must be something either in your system or perhaps a stale plugin. Try adding "-pmca state_base_verbose 5" on your cmd line. Also, so you have a hostfile somewhere, or are running in an allocation? Or is this just a one-node test?
from prrte.
This is just a vanilla "prte" run on the login node. Debug output with added flag:
prte -d --pmca ess_base_verbose 1000 -pmca state_base_verbose 5
[login:23923] mca: base: components_register: registering framework ess components
[login:23923] mca: base: components_register: found loaded component tm
[login:23923] mca: base: components_register: component tm has no register or open function
[login:23923] mca: base: components_register: found loaded component env
[login:23923] mca: base: components_register: component env has no register or open function
[login:23923] mca: base: components_register: found loaded component hnp
[login:23923] mca: base: components_register: component hnp has no register or open function
[login:23923] mca: base: components_register: found loaded component slurm
[login:23923] mca: base: components_register: component slurm has no register or open function
[login:23923] mca: base: components_open: opening ess components
[login:23923] mca: base: components_open: found loaded component tm
[login:23923] mca: base: components_open: component tm open function successful
[login:23923] mca: base: components_open: found loaded component env
[login:23923] mca: base: components_open: component env open function successful
[login:23923] mca: base: components_open: found loaded component hnp
[login:23923] mca: base: components_open: component hnp open function successful
[login:23923] mca: base: components_open: found loaded component slurm
[login:23923] mca: base: components_open: component slurm open function successful
[login:23923] mca:base:select: Auto-selecting ess components
[login:23923] mca:base:select:( ess) Querying component [tm]
[login:23923] mca:base:select:( ess) Querying component [env]
[login:23923] mca:base:select:( ess) Querying component [hnp]
[login:23923] mca:base:select:( ess) Query of component [hnp] set priority to 100
[login:23923] mca:base:select:( ess) Querying component [slurm]
[login:23923] mca:base:select:( ess) Selected component [hnp]
[login:23923] mca: base: close: component tm closed
[login:23923] mca: base: close: unloading component tm
[login:23923] mca: base: close: component env closed
[login:23923] mca: base: close: unloading component env
[login:23923] mca: base: close: component slurm closed
[login:23923] mca: base: close: unloading component slurm
[login:23923] procdir: /tmp/ompi.login.170008941/dvm/0/0
[login:23923] jobdir: /tmp/ompi.login.170008941/dvm/0
[login:23923] top: /tmp/ompi.login.170008941/dvm
[login:23923] top: /tmp/ompi.login.170008941
[login:23923] tmp: /tmp
[login:23923] sess_dir_cleanup: job session dir does not exist
[login:23923] sess_dir_cleanup: top session dir does not exist
[login:23923] procdir: /tmp/ompi.login.170008941/dvm/0/0
[login:23923] jobdir: /tmp/ompi.login.170008941/dvm/0
[login:23923] top: /tmp/ompi.login.170008941/dvm
[login:23923] top: /tmp/ompi.login.170008941
[login:23923] tmp: /tmp
[login:23923] sess_dir_finalize: proc session dir does not exist
[login:23923] sess_dir_finalize: job session dir does not exist
[login:23923] sess_dir_finalize: jobfam session dir not empty - leaving
[login:23923] sess_dir_finalize: jobfam session dir not empty - leaving
[login:23923] sess_dir_finalize: top session dir not empty - leaving
[login:23923] sess_dir_cleanup: job session dir does not exist
[login:23923] sess_dir_cleanup: found top session dir empty - deleting
Here's the configure stanza:
../psrvr-git/configure \
--enable-debug \
--with-tm \
--prefix=/.../psrvr/git \
--with-libevent=/.../libevent \
--with-hwloc=/.../hwloc/1.11.9 \
--with-pmix=/.../pmix/git
Self-built libevent, hwloc, pmix.
from prrte.
Hmmm...that is really odd. I honestly cannot duplicate it. Even using the same cmd line, it works just fine. What's disturbing here is that you aren't getting the right debug output from the ess/hnp component - you should get the topology output along with a bunch of stuff.
Try adding this "-pmca orte_report_silent_errors 1" to the cmd line - let's see if you are getting some error that thought there would be prior reporting.
from prrte.
Yeah, it's mind-boggling. I was wondering about how to enable silent errors, so thanks for the enlightenment. I can't duplicate anywhere else, either. I tried rm-rf'ing and reinstalling pmix, hwloc and prrte itself (to make sure not picking up stale plugins).
prte -d --pmca ess_base_verbose 1000 -pmca orte_report_silent_errors 1
[cn-mem:53795] mca: base: components_register: registering framework ess components
[cn-mem:53795] mca: base: components_register: found loaded component tm
[cn-mem:53795] mca: base: components_register: component tm has no register or open function
[cn-mem:53795] mca: base: components_register: found loaded component env
[cn-mem:53795] mca: base: components_register: component env has no register or open function
[cn-mem:53795] mca: base: components_register: found loaded component hnp
[cn-mem:53795] mca: base: components_register: component hnp has no register or open function
[cn-mem:53795] mca: base: components_register: found loaded component slurm
[cn-mem:53795] mca: base: components_register: component slurm has no register or open function
[cn-mem:53795] mca: base: components_open: opening ess components
[cn-mem:53795] mca: base: components_open: found loaded component tm
[cn-mem:53795] mca: base: components_open: component tm open function successful
[cn-mem:53795] mca: base: components_open: found loaded component env
[cn-mem:53795] mca: base: components_open: component env open function successful
[cn-mem:53795] mca: base: components_open: found loaded component hnp
[cn-mem:53795] mca: base: components_open: component hnp open function successful
[cn-mem:53795] mca: base: components_open: found loaded component slurm
[cn-mem:53795] mca: base: components_open: component slurm open function successful
[cn-mem:53795] mca:base:select: Auto-selecting ess components
[cn-mem:53795] mca:base:select:( ess) Querying component [tm]
[cn-mem:53795] mca:base:select:( ess) Querying component [env]
[cn-mem:53795] mca:base:select:( ess) Querying component [hnp]
[cn-mem:53795] mca:base:select:( ess) Query of component [hnp] set priority to 100
[cn-mem:53795] mca:base:select:( ess) Querying component [slurm]
[cn-mem:53795] mca:base:select:( ess) Selected component [hnp]
[cn-mem:53795] mca: base: close: component tm closed
[cn-mem:53795] mca: base: close: unloading component tm
[cn-mem:53795] mca: base: close: component env closed
[cn-mem:53795] mca: base: close: unloading component env
[cn-mem:53795] mca: base: close: component slurm closed
[cn-mem:53795] mca: base: close: unloading component slurm
[cn-mem:53795] procdir: /tmp/ompi.cn-mem.170008941/dvm/0/0
[cn-mem:53795] jobdir: /tmp/ompi.cn-mem.170008941/dvm/0
[cn-mem:53795] top: /tmp/ompi.cn-mem.170008941/dvm
[cn-mem:53795] top: /tmp/ompi.cn-mem.170008941
[cn-mem:53795] tmp: /tmp
[cn-mem:53795] sess_dir_cleanup: job session dir does not exist
[cn-mem:53795] sess_dir_cleanup: top session dir does not exist
[cn-mem:53795] procdir: /tmp/ompi.cn-mem.170008941/dvm/0/0
[cn-mem:53795] jobdir: /tmp/ompi.cn-mem.170008941/dvm/0
[cn-mem:53795] top: /tmp/ompi.cn-mem.170008941/dvm
[cn-mem:53795] top: /tmp/ompi.cn-mem.170008941
[cn-mem:53795] tmp: /tmp
[cn-mem:53795] sess_dir_finalize: proc session dir does not exist
[cn-mem:53795] sess_dir_finalize: job session dir does not exist
[cn-mem:53795] sess_dir_finalize: jobfam session dir not empty - leaving
[cn-mem:53795] sess_dir_finalize: jobfam session dir not empty - leaving
[cn-mem:53795] sess_dir_finalize: top session dir not empty - leaving
[cn-mem:53795] sess_dir_cleanup: job session dir does not exist
[cn-mem:53795] sess_dir_cleanup: found top session dir empty - deleting
from prrte.
Afraid I am at a loss - all I can suggest is go into orte/mca/ess/hnp/ess_hnp_module.c and add print statements (or use your favorite debugger) to see where it decides to jump to exit.
from prrte.
I'd descended to ess_hnp_module.c but wanted to ask in case those-in-the-know prevented a wild goose chase.
thanks
from prrte.
Looks like the failure is in pmix_server_init
specifically in PMIx_server_init
Seems like it is pmix_hwloc_get_topology
that is falling over, which makes sense, given the expected output.
from prrte.
I think this might be a case of differences in hwloc versions - I've been building against v2.0.1. I'll build against v1.11.7 and see if I can then replicate.
from prrte.
I've got 1.11.9 and 2.0.1 in my home directory, plus 1.11.3 cluster-wide. I have 1.11.9 installed via yum/dnf on the boxes where it works.
from prrte.
One thing I have found is that our hwloc configury can get confused if there is a version in standard locations, and you also have another version in your path. You might check to ensure that isn't the case. Meantime, I'll test here.
from prrte.
Okay, that was quick - got a segfault when running against 1.11. I'll try to chase this down. In the interim, I would recommend using 2.0.1 as that is known to work.
from prrte.
@bgoglin I'm finding that the topology contains several garbage fields when loading it via v1.11 instead of v2.0.1. It looks to me like the hwloc_topology_t structure has been initialized, but isn't actually being loaded. The online cpuset is NULL, the arity value is at some max value, etc. Here is the root object:
{type = HWLOC_OBJ_SYSTEM, os_index = 0, name = 0x0, memory = {total_memory = 0, local_memory = 0, page_types_len = 4212383744, page_types = 0x1ea8f70}, attr = 0x0, depth = 0,
logical_index = 0, os_level = 0, next_cousin = 0x0, prev_cousin = 0x0, parent = 0x0, sibling_rank = 0, next_sibling = 0x2, prev_sibling = 0x1eaf3a0, arity = 32179136,
children = 0x1eb5430, first_child = 0x1, last_child = 0x0, userdata = 0x1ec4dd0, cpuset = 0x0, complete_cpuset = 0x0, online_cpuset = 0x0, allowed_cpuset = 0x1ea9490,
nodeset = 0x1ea9500, complete_nodeset = 0x1ea9570, allowed_nodeset = 0x1ea95e0, distances = 0x1ebd6e0, distances_count = 24, infos = 0x0, infos_count = 1, symmetric_subtree = 0}
Do you have any ideas why it would be this messed up? I'm wondering if either the hwloc libraries and headers are getting mixed up, or if we somehow broke the v1.11 integration when we added the v2.0.1 support.
from prrte.
@rhc54 It looks like the topology was loaded with v2 while you are reading it wth v1.11, it would explain several garbage fields above. I'll look later today.
from prrte.
Ah...PMIx wasn't finding hwloc: pointed PMIx and PRRTE at the same hwloc, and all is well.
thanks!
from prrte.
@tonycurtis glad to hear you are up and running! Thx for the patience.
@bgoglin It sounds like you are thinking we have a 1.11 header, but are using the v2 library? I can explore that some more - I think this may go back to our long-standing configury problem of picking up things in common directories instead of a specific one for just that one package.
In other words: I have libevent installed in the same location as hwloc v2.0, but I was pointing --with-hwloc at the 1.11 location. When I check with ldd things are linked correctly. However, my library path may wind up catching the v2 library first. I'll try to do some checking today.
from prrte.
@rhc54 a different problem now :( but will open new ticket for that
from prrte.
I'm closing this for now - we can look at the hwloc lib confusion separately.
from prrte.
Related Issues (20)
- add-hostfile not working on parallel prun commands HOT 20
- RMAPS round_robin bind_multiple issue HOT 1
- Compile failure with "missing separator" HOT 1
- `prte_stdint.h: error: conflicting types for 'intptr_t'; have 'int'` HOT 22
- pterm conflicts with putty HOT 35
- 3.0.2: autogen.pl script fails HOT 8
- mpirun/prte hang after application completion HOT 46
- pterm name collision HOT 1
- OMPI cmd line processing converts all single-dash options to double-dash HOT 3
- Problems dealing with shared TMPDIRs HOT 18
- mpirun --report-bindings segfault HOT 6
- v3.0.3 release checklist
- prted is missing an option '--allow-run-as-root'
- Enabling debugging options for prrte HOT 2
- PR 1907 broke support for at least one non-ssh PLM component HOT 3
- Option --use-hwthread-cpus incorrectly translated to --bind-to :hwthread
- Binding to partially disabled objects HOT 15
- Can't launch prted HOT 22
- Building for Fault Tolerance HOT 6
- Slurm integration HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from prrte.