Code Monkey home page Code Monkey logo

Comments (8)

uwiger avatar uwiger commented on August 28, 2024

It's certainly not a feature! :)

I'm looking into it.

from gproc.

uwiger avatar uwiger commented on August 28, 2024

I have some ideas, but the trickiest part of the problem is that a netsplit occurs. Only a few of the gen_leader versions (e.g. garret-smith/gen_leader_revival) have some support for netsplits, and at least when I try this scenario with garret-smith's version, it doesn't seem to do the right thing.

However, a few things come to mind:

  • Gen_leader is further delayed after the node ping times out, unless -kernel dist_auto_connect is set to once or never. The reason is that each message to the unresponsive node will lead to a connection attempt, which will then hang for a while.
  • Until we have solid netsplit handling in both gen_leader and gproc, I recommend that you set up your own higher-level supervision. One way to do this is to set -kernel dist_auto_connect once, as mentioned above, then have a process on each node that periodically sends a UDP message to the other known (but not necessarily connected) nodes. If you receive a UDP message from a node that's not in the nodes() list, you have a netsplit situation. If you have no better strategy available, you can then restart the nodes that make up one of the 'islands'.

from gproc.

norton avatar norton commented on August 28, 2024

@msadkov There exists an application to detect network splits for mnesia and hibari. It would need some customisation for gproc. Nevertheless, it might be of help to you.

The application is here => https://github.com/hibari/partition-detector

The admin documentation is here => http://hibari.github.com/hibari-doc/hibari-sysadmin-guide.en.html#partition-detector

from gproc.

msadkov avatar msadkov commented on August 28, 2024

@uwiger @norton thank you for your replies! I'm aware of gen_leader/gproc not being able to handle net splits, so -kernel dist_auto_connect was set to once in this case (I should have mentioned this in my first post, sorry).. with that said, this situation doesn't look like a net split, but rather node going down (not immediately, but after a timeout), right? and after unresponsive node disappears from nodes list (which means a terminated connection, AFAIU) there is no timeout involved anymore, since I can get an immediate DOWN message after calling erlang:monitor with a dead pid sitting in gproc's ets..

from gproc.

uwiger avatar uwiger commented on August 28, 2024

I'm making som progress getting gproc to heal after netsplits, as well as doing proper monitoring. I don't have a solution for handling conflicts yet, and need to fix some regression bugs. I'll keep you informed..

from gproc.

msadkov avatar msadkov commented on August 28, 2024

Thank you!

from gproc.

uwiger avatar uwiger commented on August 28, 2024

Going through the issues list, closing out issues. This one is not yet resolved, so I'll leave it open. Sorry about the delay.

from gproc.

uwiger avatar uwiger commented on August 28, 2024

Closing this issue. Feel free to try out the locks_leader branch which should handle netsplits more robustly.

from gproc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.