Code Monkey home page Code Monkey logo

Comments (5)

ggouaillardet avatar ggouaillardet commented on June 9, 2024

@rhc54 sure, I'll take a crack at it !

how can you reproduce this issue ?
does a simple non PMIx program that writes a lot to stdout can do the trick ?

from prrte.

rhc54 avatar rhc54 commented on June 9, 2024

You can find my "cycle.sh" here: https://github.com/pmix/pmix-tests/blob/master/prrte/cycle.sh

from prrte.

rhc54 avatar rhc54 commented on June 9, 2024

I think I've thought through the issue, though I haven't attempted a fix. I believe the problem is a race condition between PRRTE sending a job-terminated notice to prun and the IOF sending the output to prun. If the job-terminated notification gets delivered first, then prun will exit - yet the IOF doeesn't know that has happened and still attempts to write the apps output down the (now defunct) socket.

The issue is that prun is not a child of PRRTE - it's just a connected tool. We need to ensure that any IOF collected from the associated app gets sent to prun before we send the termination notification. We then need to ensure that prun outputs any received IO prior to actually terminating - guess we need an "IOF flush" somewhere in prun's finalization routine.

HTH

from prrte.

rhc54 avatar rhc54 commented on June 9, 2024

@ggouaillardet I spent the day working on this and found that my above premise was incorrect - this had nothing really to do with the IOF. The problem appears to be in either the notification or the messaging system of the client.

What I found was that we were caching notifications for a long period of time, as you previously noted. However, since PRRTE wasn't providing set targets for those job-termination notices, every prun would get a flood of notices about prior jobs when it connected to the server. The client event system would filter those so that prun itself never saw them, but once you built up a lot of them, prun would start to hang under the flood.

Adding the IOF into the mix seemed to make it happen faster, though I couldn't solidly confirm that impression. I'm inclined to believe it is solidly in the event system because I was able to fix the problem by revamping solely the event system. However, the revamp also eliminated the message flood, and so I might have solved only the symptom. We need to stress the server-to-client message system to be sure.

My fixes are in two PRs, one for PMIx itself and the other for PRRTE:
openpmix/openpmix#1003
#175

With those fixes, I am able to run my cycle.sh test essentially forever (well, I tested 10k iterations without a problem, so that's forever to me!).

Would you have a chance to check out the messaging system? I'd like to ensure my fixes aren't just masking the root cause of the problem.

from prrte.

rhc54 avatar rhc54 commented on June 9, 2024

Looks like this is fixed

from prrte.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.