Comments (5)
@rhc54 sure, I'll take a crack at it !
how can you reproduce this issue ?
does a simple non PMIx program that writes a lot to stdout can do the trick ?
from prrte.
You can find my "cycle.sh" here: https://github.com/pmix/pmix-tests/blob/master/prrte/cycle.sh
from prrte.
I think I've thought through the issue, though I haven't attempted a fix. I believe the problem is a race condition between PRRTE sending a job-terminated notice to prun and the IOF sending the output to prun. If the job-terminated notification gets delivered first, then prun will exit - yet the IOF doeesn't know that has happened and still attempts to write the apps output down the (now defunct) socket.
The issue is that prun is not a child of PRRTE - it's just a connected tool. We need to ensure that any IOF collected from the associated app gets sent to prun before we send the termination notification. We then need to ensure that prun outputs any received IO prior to actually terminating - guess we need an "IOF flush" somewhere in prun's finalization routine.
HTH
from prrte.
@ggouaillardet I spent the day working on this and found that my above premise was incorrect - this had nothing really to do with the IOF. The problem appears to be in either the notification or the messaging system of the client.
What I found was that we were caching notifications for a long period of time, as you previously noted. However, since PRRTE wasn't providing set targets for those job-termination notices, every prun would get a flood of notices about prior jobs when it connected to the server. The client event system would filter those so that prun itself never saw them, but once you built up a lot of them, prun would start to hang under the flood.
Adding the IOF into the mix seemed to make it happen faster, though I couldn't solidly confirm that impression. I'm inclined to believe it is solidly in the event system because I was able to fix the problem by revamping solely the event system. However, the revamp also eliminated the message flood, and so I might have solved only the symptom. We need to stress the server-to-client message system to be sure.
My fixes are in two PRs, one for PMIx itself and the other for PRRTE:
openpmix/openpmix#1003
#175
With those fixes, I am able to run my cycle.sh test essentially forever (well, I tested 10k iterations without a problem, so that's forever to me!).
Would you have a chance to check out the messaging system? I'd like to ensure my fixes aren't just masking the root cause of the problem.
from prrte.
Looks like this is fixed
from prrte.
Related Issues (20)
- showing rml params with prte_info HOT 1
- Undefined reference when trying to compile prrte HOT 2
- v3.0.1 Release Checklist
- prte cleans the whole job when all job-local procs terminate at a given prted HOT 4
- example direct launch debugger tool HOT 2
- XML mapping utility HOT 3
- prrte 3.0.0 fails to build with pmix 4.2.3 HOT 2
- Question about mca parameter passing HOT 30
- add-hostfile not working on parallel prun commands HOT 20
- RMAPS round_robin bind_multiple issue HOT 1
- Compile failure with "missing separator" HOT 1
- `prte_stdint.h: error: conflicting types for 'intptr_t'; have 'int'` HOT 22
- pterm conflicts with putty HOT 35
- 3.0.2: autogen.pl script fails HOT 8
- mpirun/prte hang after application completion HOT 46
- pterm name collision HOT 1
- OMPI cmd line processing converts all single-dash options to double-dash HOT 3
- Problems dealing with shared TMPDIRs HOT 18
- mpirun --report-bindings segfault HOT 6
- v3.0.3 release checklist
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from prrte.