Code Monkey home page Code Monkey logo

Comments (4)

ioguix avatar ioguix commented on May 13, 2024

Hello,

Considering the code, I believe you meant: https://github.com/dalibo/PAF/blob/v1.0.0/script/pgsqlms#L940

Considering the OCF developer guide[1] about the stop action, you are right.

But we have to study this case in detail. At least, if we don't catch this failing status during the stop precess, it MUST be catched somewhere else at some point. Depending on your Pacemaker setup, the CRM might try to recover your failing master (demote->stop->start->promote transition), or maybe this failing instance will be started back as a slave later. This can leads to unpredictable situations.

If I remember correctly, we decided to raise a failure to put this consideration away for this first version and make sure this resource in an inconsistent state will be kicked out of the cluster until a human fix this up. Maybe we are over-protective?

Note that if you set up fencing, the failing node is supposed to be fenced (after failcount recover) and a slave is supposed to be promoted.

Unless I miss something really important here, I prefer to focus on documentation and v1.1 for now, but definitely we keep this open until we can give it some more time and tests.

But feel free to experiment and post feedback, contributions and new contribution are very welcome, obviously! I'll give time to follow up with this issue on updates.

[1] http://www.linux-ha.org/doc/dev-guides/_literal_stop_literal_action.html

from paf.

ioguix avatar ioguix commented on May 13, 2024

Hello,

I just pushed a commit that might help if the CRM tries to recover the resource on the same node. See 8eb2bab.

We still have to investigate for other situation where the CRM really wants to move the resource to another node and the resource is not able to start correctly anymore though.

Regards,

from paf.

ioguix avatar ioguix commented on May 13, 2024

Hi,

Considering this (and partly #9), I think using a watchdog (eg. soft dog which is available in the Linux kernel) might help you in this scenario. This is not a remote fencing, but at leaast the node can commit a suicide and the transition can keep going on.

However, I'm not sure yet if this is supported under Pacemaker/RHEL 6.

from paf.

ioguix avatar ioguix commented on May 13, 2024

Hello,

I believe the solution is to set up watchdog in your cluster so the node where your resource can not stop can fence itself.

I wrote some documentation about how to set up watchdog under CentOS 7. It shouldn't be hard to adapt to EL6 if needed. See: http://dalibo.github.io/PAF/CentOS-7-admin-cookbook.html#setting-up-a-watchdog

Regards,

from paf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.