I am in a test environment now, and when I shutdown my node currently running pgsqld a

2 node cluster, when master node is shutdown, promotion of pgsqld on slave is aborted about paf HOT 21 CLOSED

themactech commented on May 23, 2024

2 node cluster, when master node is shutdown, promotion of pgsqld on slave is aborted

from paf.

Comments (21)

themactech commented on May 23, 2024 1

Sorry for not getting back sooner, but I was given other priorities, this project will be deployed without HA and a manual failover process will do (which is currently working). Thank you for your input and your very generous offer of time/help.

from paf.

ioguix commented on May 23, 2024

Hi,

The promotion score are supplied by the primary node. So if you shutdown the standby and the primary, before the primary noticed the standby went away, the score will sit there forever.

You can set the score by hand using the crm_attribute --promotion or crm_master (deprecated) to trigger some changes if needed.

Regards,

from paf.

themactech commented on May 23, 2024

That makes little sense to me, if an abrupt shutdown breaks the cluster, this means it will not survive a catastrophic hardware failure that immediately brings the master node down... That seems like it defeats the purpose of a HA cluster. I am rebuilding my setup (currently working on VM for testing), and will confirm if the failover-promotion works when the master node is put on standby. If this works, then I will need to investigate if there are options to gracefully failover even in the case of the master node being put abruptly offline.

from paf.

themactech commented on May 23, 2024

If the promotion score is the issue, that should not be in play in a 2-node cluster, or if a multi-node cluster has only 1 standby PostgreSQL node. Could that not be verified first to determine if a promotion score query is even needed? The corosync configuration has a special entry for a 2-node cluster (so quorum still works), does the PAF agent check for this?

from paf.

themactech commented on May 23, 2024

Here are the results of my testing:

When initially configured, my cluster has the two following nodes:
-cdrdbmaster
-cdrdbslave

cdrdbmaster is the DC and runs the VirtualIP and is also master for pgsqld-clone

If I put the cdrdbmaster on standby with:

pcs node standby cdrdbmaster

then the cdrdbslave becomes DC, and properly takes over the Virtual IP, and get pgsqld promoted to master

If I then put the cdrdbmaster back online with:

pcs node unstandby cdrdbmaster

All services (and DC) stay on cdrdbslave (which is fine)

But then if I put the cdrdbslave on standby with:

pcs node standby cdrdbslave

The DC gets properly taken over by cdrdbmaster (so the cluster seem to operate fine) BUT the pgsqld service never promotes the pgsqld-clone on the master. It stays in slave mode, and because of the colocation constraint, the VirtualIP resource is stopped.

When looking at the logs, I see the call to promote the pgsqld service on cdrdbslave (on my first standby test) and this being successful, but see no trace of a second promote command when setting the cdrdbslave on standby (when it was DC). This is my pcs status at this point:

from paf.

ioguix commented on May 23, 2024

Hi,

Could you share the various scores on your cluster? The following commands should help:

crm_mon -frAno
crm_simulate -sL
pcs constraint list and pcs constraint show

from paf.

ioguix commented on May 23, 2024

That makes little sense to me, if an abrupt shutdown breaks the cluster, this means it will not survive a catastrophic hardware failure that immediately brings the master node down...

It does.

You just created a scenario with a double failure scenario, moreover, with some obscure step I'm not sure to understand yet. So it's hard to explain why the cluster is acting like that.

BTW, How do you shutdown your nodes? I'm confused about this:

And even after a few minutes of the cdrdbslave node being completely shutdown, the pgsqld resource was not promoted ever on the cdrdbmaster node.

How a node could not be completely shutdown? What are you doing exactly? Maybe corosync is still up and running?

from paf.

themactech commented on May 23, 2024

I shutdown the node from the terminal with:
"sudo shutdown -h now"
Following tests (if that had been successful), would have been to kill processes
I do not have acces to the VM to get you the info requested earlier but will obtain it for you as soon as I can.

from paf.

themactech commented on May 23, 2024

I should note that this shutdown test worked for all other Pacemaker clusters I have setup. Slave node was properly promoted. Starting it back up and then shutting down the newly promoted slave returned the original master to master.

I have done this with DRDB, Urbackup docker images, and a few other services.

from paf.

ioguix commented on May 23, 2024

Hi,

Following tests (if that had been successful), would have been to kill processes

Go ahead, kill processes...

I really would like to be able to reproduce. So please, provide a detailed procedure and versions of pacemaker, PAF, pgsql and OS.

Thanks,

from paf.

themactech commented on May 23, 2024

Sorry for the delay. I am currently writing a buildbook for this and also writing a script to automate the install. So I would be able to give you very detailed build instructions and would also provide the version info you requested. I would not however be able to post this in open forums. Would I be able to get that to you by other means? Once/if we solve this issue, I would gladly to a write-up in the forums for everyone's benefit.

from paf.

ioguix commented on May 23, 2024

Isn't it possible to give simple instructions about how to reproduce?
Note that there's already some (old) vagrant scripts to provision a Pcmk/PAF cluster in few minutes in the extra/ folder.
But anyway, can I reach you using your public email on your github account if you want to provide some private infos/script/buildbook?

from paf.

themactech commented on May 23, 2024

Yes, you can reach me at [email protected].

from paf.

ioguix commented on May 23, 2024

Well, my email is rejected by your provider for non obvious reasons... sorry.

from paf.

themactech commented on May 23, 2024

Could we connect via LinkedIn?

from paf.

ioguix commented on May 23, 2024

Sorry, I still fail to understand why it's so complicated to describe step by step how to reproduce the issue...

from paf.

themactech commented on May 23, 2024

Do I cover the entire PostgreSQL replication setup or just go from the Pacemaker cluster setup? If you do not need the PostgreSQL replication setup I do then it would indeed be much shorter, and I would just include it in here.

from paf.

themactech commented on May 23, 2024

I will have detailed steps and versions posted here by tomorrow.

from paf.

ioguix commented on May 23, 2024

Unless your replication setup is really exotic, or if you want some advice about it, I don't need it...yet.

Thanks,

from paf.

themactech commented on May 23, 2024

Going to be late this weekend or Monday, could not get that done today, and will have limited access over the weekend.

from paf.

ioguix commented on May 23, 2024

If your RTO/RPO allows manual failover, this is ALWAYS the best solution.

Regards,

from paf.

2 node cluster, when master node is shutdown, promotion of pgsqld on slave is aborted about paf HOT 21 CLOSED

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent