Code Monkey home page Code Monkey logo

Comments (17)

rhc54 avatar rhc54 commented on June 2, 2024

It's working fine on the systems I can access, so it must be something either in your system or perhaps a stale plugin. Try adding "-pmca state_base_verbose 5" on your cmd line. Also, so you have a hostfile somewhere, or are running in an allocation? Or is this just a one-node test?

from prrte.

tonycurtis avatar tonycurtis commented on June 2, 2024

This is just a vanilla "prte" run on the login node. Debug output with added flag:

prte -d --pmca ess_base_verbose 1000  -pmca state_base_verbose 5
[login:23923] mca: base: components_register: registering framework ess components
[login:23923] mca: base: components_register: found loaded component tm
[login:23923] mca: base: components_register: component tm has no register or open function
[login:23923] mca: base: components_register: found loaded component env
[login:23923] mca: base: components_register: component env has no register or open function
[login:23923] mca: base: components_register: found loaded component hnp
[login:23923] mca: base: components_register: component hnp has no register or open function
[login:23923] mca: base: components_register: found loaded component slurm
[login:23923] mca: base: components_register: component slurm has no register or open function
[login:23923] mca: base: components_open: opening ess components
[login:23923] mca: base: components_open: found loaded component tm
[login:23923] mca: base: components_open: component tm open function successful
[login:23923] mca: base: components_open: found loaded component env
[login:23923] mca: base: components_open: component env open function successful
[login:23923] mca: base: components_open: found loaded component hnp
[login:23923] mca: base: components_open: component hnp open function successful
[login:23923] mca: base: components_open: found loaded component slurm
[login:23923] mca: base: components_open: component slurm open function successful
[login:23923] mca:base:select: Auto-selecting ess components
[login:23923] mca:base:select:(  ess) Querying component [tm]
[login:23923] mca:base:select:(  ess) Querying component [env]
[login:23923] mca:base:select:(  ess) Querying component [hnp]
[login:23923] mca:base:select:(  ess) Query of component [hnp] set priority to 100
[login:23923] mca:base:select:(  ess) Querying component [slurm]
[login:23923] mca:base:select:(  ess) Selected component [hnp]
[login:23923] mca: base: close: component tm closed
[login:23923] mca: base: close: unloading component tm
[login:23923] mca: base: close: component env closed
[login:23923] mca: base: close: unloading component env
[login:23923] mca: base: close: component slurm closed
[login:23923] mca: base: close: unloading component slurm
[login:23923] procdir: /tmp/ompi.login.170008941/dvm/0/0
[login:23923] jobdir: /tmp/ompi.login.170008941/dvm/0
[login:23923] top: /tmp/ompi.login.170008941/dvm
[login:23923] top: /tmp/ompi.login.170008941
[login:23923] tmp: /tmp
[login:23923] sess_dir_cleanup: job session dir does not exist
[login:23923] sess_dir_cleanup: top session dir does not exist
[login:23923] procdir: /tmp/ompi.login.170008941/dvm/0/0
[login:23923] jobdir: /tmp/ompi.login.170008941/dvm/0
[login:23923] top: /tmp/ompi.login.170008941/dvm
[login:23923] top: /tmp/ompi.login.170008941
[login:23923] tmp: /tmp
[login:23923] sess_dir_finalize: proc session dir does not exist
[login:23923] sess_dir_finalize: job session dir does not exist
[login:23923] sess_dir_finalize: jobfam session dir not empty - leaving
[login:23923] sess_dir_finalize: jobfam session dir not empty - leaving
[login:23923] sess_dir_finalize: top session dir not empty - leaving
[login:23923] sess_dir_cleanup: job session dir does not exist
[login:23923] sess_dir_cleanup: found top session dir empty - deleting

Here's the configure stanza:

../psrvr-git/configure \
    --enable-debug \
    --with-tm \
    --prefix=/.../psrvr/git \
    --with-libevent=/.../libevent \
    --with-hwloc=/.../hwloc/1.11.9 \
    --with-pmix=/.../pmix/git

Self-built libevent, hwloc, pmix.

from prrte.

rhc54 avatar rhc54 commented on June 2, 2024

Hmmm...that is really odd. I honestly cannot duplicate it. Even using the same cmd line, it works just fine. What's disturbing here is that you aren't getting the right debug output from the ess/hnp component - you should get the topology output along with a bunch of stuff.

Try adding this "-pmca orte_report_silent_errors 1" to the cmd line - let's see if you are getting some error that thought there would be prior reporting.

from prrte.

tonycurtis avatar tonycurtis commented on June 2, 2024

Yeah, it's mind-boggling. I was wondering about how to enable silent errors, so thanks for the enlightenment. I can't duplicate anywhere else, either. I tried rm-rf'ing and reinstalling pmix, hwloc and prrte itself (to make sure not picking up stale plugins).

prte -d --pmca ess_base_verbose 1000 -pmca orte_report_silent_errors 1
[cn-mem:53795] mca: base: components_register: registering framework ess components
[cn-mem:53795] mca: base: components_register: found loaded component tm
[cn-mem:53795] mca: base: components_register: component tm has no register or open function
[cn-mem:53795] mca: base: components_register: found loaded component env
[cn-mem:53795] mca: base: components_register: component env has no register or open function
[cn-mem:53795] mca: base: components_register: found loaded component hnp
[cn-mem:53795] mca: base: components_register: component hnp has no register or open function
[cn-mem:53795] mca: base: components_register: found loaded component slurm
[cn-mem:53795] mca: base: components_register: component slurm has no register or open function
[cn-mem:53795] mca: base: components_open: opening ess components
[cn-mem:53795] mca: base: components_open: found loaded component tm
[cn-mem:53795] mca: base: components_open: component tm open function successful
[cn-mem:53795] mca: base: components_open: found loaded component env
[cn-mem:53795] mca: base: components_open: component env open function successful
[cn-mem:53795] mca: base: components_open: found loaded component hnp
[cn-mem:53795] mca: base: components_open: component hnp open function successful
[cn-mem:53795] mca: base: components_open: found loaded component slurm
[cn-mem:53795] mca: base: components_open: component slurm open function successful
[cn-mem:53795] mca:base:select: Auto-selecting ess components
[cn-mem:53795] mca:base:select:(  ess) Querying component [tm]
[cn-mem:53795] mca:base:select:(  ess) Querying component [env]
[cn-mem:53795] mca:base:select:(  ess) Querying component [hnp]
[cn-mem:53795] mca:base:select:(  ess) Query of component [hnp] set priority to 100
[cn-mem:53795] mca:base:select:(  ess) Querying component [slurm]
[cn-mem:53795] mca:base:select:(  ess) Selected component [hnp]
[cn-mem:53795] mca: base: close: component tm closed
[cn-mem:53795] mca: base: close: unloading component tm
[cn-mem:53795] mca: base: close: component env closed
[cn-mem:53795] mca: base: close: unloading component env
[cn-mem:53795] mca: base: close: component slurm closed
[cn-mem:53795] mca: base: close: unloading component slurm
[cn-mem:53795] procdir: /tmp/ompi.cn-mem.170008941/dvm/0/0
[cn-mem:53795] jobdir: /tmp/ompi.cn-mem.170008941/dvm/0
[cn-mem:53795] top: /tmp/ompi.cn-mem.170008941/dvm
[cn-mem:53795] top: /tmp/ompi.cn-mem.170008941
[cn-mem:53795] tmp: /tmp
[cn-mem:53795] sess_dir_cleanup: job session dir does not exist
[cn-mem:53795] sess_dir_cleanup: top session dir does not exist
[cn-mem:53795] procdir: /tmp/ompi.cn-mem.170008941/dvm/0/0
[cn-mem:53795] jobdir: /tmp/ompi.cn-mem.170008941/dvm/0
[cn-mem:53795] top: /tmp/ompi.cn-mem.170008941/dvm
[cn-mem:53795] top: /tmp/ompi.cn-mem.170008941
[cn-mem:53795] tmp: /tmp
[cn-mem:53795] sess_dir_finalize: proc session dir does not exist
[cn-mem:53795] sess_dir_finalize: job session dir does not exist
[cn-mem:53795] sess_dir_finalize: jobfam session dir not empty - leaving
[cn-mem:53795] sess_dir_finalize: jobfam session dir not empty - leaving
[cn-mem:53795] sess_dir_finalize: top session dir not empty - leaving
[cn-mem:53795] sess_dir_cleanup: job session dir does not exist
[cn-mem:53795] sess_dir_cleanup: found top session dir empty - deleting

from prrte.

rhc54 avatar rhc54 commented on June 2, 2024

Afraid I am at a loss - all I can suggest is go into orte/mca/ess/hnp/ess_hnp_module.c and add print statements (or use your favorite debugger) to see where it decides to jump to exit.

from prrte.

tonycurtis avatar tonycurtis commented on June 2, 2024

I'd descended to ess_hnp_module.c but wanted to ask in case those-in-the-know prevented a wild goose chase.

thanks

from prrte.

tonycurtis avatar tonycurtis commented on June 2, 2024

Looks like the failure is in pmix_server_init specifically in PMIx_server_init

Seems like it is pmix_hwloc_get_topology that is falling over, which makes sense, given the expected output.

from prrte.

rhc54 avatar rhc54 commented on June 2, 2024

I think this might be a case of differences in hwloc versions - I've been building against v2.0.1. I'll build against v1.11.7 and see if I can then replicate.

from prrte.

tonycurtis avatar tonycurtis commented on June 2, 2024

I've got 1.11.9 and 2.0.1 in my home directory, plus 1.11.3 cluster-wide. I have 1.11.9 installed via yum/dnf on the boxes where it works.

from prrte.

rhc54 avatar rhc54 commented on June 2, 2024

One thing I have found is that our hwloc configury can get confused if there is a version in standard locations, and you also have another version in your path. You might check to ensure that isn't the case. Meantime, I'll test here.

from prrte.

rhc54 avatar rhc54 commented on June 2, 2024

Okay, that was quick - got a segfault when running against 1.11. I'll try to chase this down. In the interim, I would recommend using 2.0.1 as that is known to work.

from prrte.

rhc54 avatar rhc54 commented on June 2, 2024

@bgoglin I'm finding that the topology contains several garbage fields when loading it via v1.11 instead of v2.0.1. It looks to me like the hwloc_topology_t structure has been initialized, but isn't actually being loaded. The online cpuset is NULL, the arity value is at some max value, etc. Here is the root object:

{type = HWLOC_OBJ_SYSTEM, os_index = 0, name = 0x0, memory = {total_memory = 0, local_memory = 0, page_types_len = 4212383744, page_types = 0x1ea8f70}, attr = 0x0, depth = 0, 
  logical_index = 0, os_level = 0, next_cousin = 0x0, prev_cousin = 0x0, parent = 0x0, sibling_rank = 0, next_sibling = 0x2, prev_sibling = 0x1eaf3a0, arity = 32179136, 
  children = 0x1eb5430, first_child = 0x1, last_child = 0x0, userdata = 0x1ec4dd0, cpuset = 0x0, complete_cpuset = 0x0, online_cpuset = 0x0, allowed_cpuset = 0x1ea9490, 
  nodeset = 0x1ea9500, complete_nodeset = 0x1ea9570, allowed_nodeset = 0x1ea95e0, distances = 0x1ebd6e0, distances_count = 24, infos = 0x0, infos_count = 1, symmetric_subtree = 0}

Do you have any ideas why it would be this messed up? I'm wondering if either the hwloc libraries and headers are getting mixed up, or if we somehow broke the v1.11 integration when we added the v2.0.1 support.

from prrte.

bgoglin avatar bgoglin commented on June 2, 2024

@rhc54 It looks like the topology was loaded with v2 while you are reading it wth v1.11, it would explain several garbage fields above. I'll look later today.

from prrte.

tonycurtis avatar tonycurtis commented on June 2, 2024

Ah...PMIx wasn't finding hwloc: pointed PMIx and PRRTE at the same hwloc, and all is well.

thanks!

from prrte.

rhc54 avatar rhc54 commented on June 2, 2024

@tonycurtis glad to hear you are up and running! Thx for the patience.

@bgoglin It sounds like you are thinking we have a 1.11 header, but are using the v2 library? I can explore that some more - I think this may go back to our long-standing configury problem of picking up things in common directories instead of a specific one for just that one package.

In other words: I have libevent installed in the same location as hwloc v2.0, but I was pointing --with-hwloc at the 1.11 location. When I check with ldd things are linked correctly. However, my library path may wind up catching the v2 library first. I'll try to do some checking today.

from prrte.

tonycurtis avatar tonycurtis commented on June 2, 2024

@rhc54 a different problem now :( but will open new ticket for that

from prrte.

rhc54 avatar rhc54 commented on June 2, 2024

I'm closing this for now - we can look at the hwloc lib confusion separately.

from prrte.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.