Comments (21)
Sorry for not getting back sooner, but I was given other priorities, this project will be deployed without HA and a manual failover process will do (which is currently working). Thank you for your input and your very generous offer of time/help.
from paf.
Hi,
The promotion score are supplied by the primary node. So if you shutdown the standby and the primary, before the primary noticed the standby went away, the score will sit there forever.
You can set the score by hand using the crm_attribute --promotion
or crm_master
(deprecated) to trigger some changes if needed.
Regards,
from paf.
That makes little sense to me, if an abrupt shutdown breaks the cluster, this means it will not survive a catastrophic hardware failure that immediately brings the master node down... That seems like it defeats the purpose of a HA cluster. I am rebuilding my setup (currently working on VM for testing), and will confirm if the failover-promotion works when the master node is put on standby. If this works, then I will need to investigate if there are options to gracefully failover even in the case of the master node being put abruptly offline.
from paf.
If the promotion score is the issue, that should not be in play in a 2-node cluster, or if a multi-node cluster has only 1 standby PostgreSQL node. Could that not be verified first to determine if a promotion score query is even needed? The corosync configuration has a special entry for a 2-node cluster (so quorum still works), does the PAF agent check for this?
from paf.
Here are the results of my testing:
When initially configured, my cluster has the two following nodes:
-cdrdbmaster
-cdrdbslave
cdrdbmaster is the DC and runs the VirtualIP and is also master for pgsqld-clone
If I put the cdrdbmaster on standby with:
pcs node standby cdrdbmaster
then the cdrdbslave becomes DC, and properly takes over the Virtual IP, and get pgsqld promoted to master
If I then put the cdrdbmaster back online with:
pcs node unstandby cdrdbmaster
All services (and DC) stay on cdrdbslave (which is fine)
But then if I put the cdrdbslave on standby with:
pcs node standby cdrdbslave
The DC gets properly taken over by cdrdbmaster (so the cluster seem to operate fine) BUT the pgsqld service never promotes the pgsqld-clone on the master. It stays in slave mode, and because of the colocation constraint, the VirtualIP resource is stopped.
When looking at the logs, I see the call to promote the pgsqld service on cdrdbslave (on my first standby test) and this being successful, but see no trace of a second promote command when setting the cdrdbslave on standby (when it was DC). This is my pcs status at this point:
from paf.
Hi,
Could you share the various scores on your cluster? The following commands should help:
crm_mon -frAno
crm_simulate -sL
pcs constraint list
andpcs constraint show
from paf.
That makes little sense to me, if an abrupt shutdown breaks the cluster, this means it will not survive a catastrophic hardware failure that immediately brings the master node down...
It does.
You just created a scenario with a double failure scenario, moreover, with some obscure step I'm not sure to understand yet. So it's hard to explain why the cluster is acting like that.
BTW, How do you shutdown your nodes? I'm confused about this:
And even after a few minutes of the cdrdbslave node being completely shutdown, the pgsqld resource was not promoted ever on the cdrdbmaster node.
How a node could not be completely shutdown? What are you doing exactly? Maybe corosync is still up and running?
from paf.
I shutdown the node from the terminal with:
"sudo shutdown -h now"
Following tests (if that had been successful), would have been to kill processes
I do not have acces to the VM to get you the info requested earlier but will obtain it for you as soon as I can.
from paf.
I should note that this shutdown test worked for all other Pacemaker clusters I have setup. Slave node was properly promoted. Starting it back up and then shutting down the newly promoted slave returned the original master to master.
I have done this with DRDB, Urbackup docker images, and a few other services.
from paf.
Hi,
Following tests (if that had been successful), would have been to kill processes
Go ahead, kill processes...
I really would like to be able to reproduce. So please, provide a detailed procedure and versions of pacemaker, PAF, pgsql and OS.
Thanks,
from paf.
Sorry for the delay. I am currently writing a buildbook for this and also writing a script to automate the install. So I would be able to give you very detailed build instructions and would also provide the version info you requested. I would not however be able to post this in open forums. Would I be able to get that to you by other means? Once/if we solve this issue, I would gladly to a write-up in the forums for everyone's benefit.
from paf.
Isn't it possible to give simple instructions about how to reproduce?
Note that there's already some (old) vagrant scripts to provision a Pcmk/PAF cluster in few minutes in the extra/
folder.
But anyway, can I reach you using your public email on your github account if you want to provide some private infos/script/buildbook?
from paf.
Yes, you can reach me at [email protected].
from paf.
Well, my email is rejected by your provider for non obvious reasons... sorry.
from paf.
Could we connect via LinkedIn?
from paf.
Sorry, I still fail to understand why it's so complicated to describe step by step how to reproduce the issue...
from paf.
Do I cover the entire PostgreSQL replication setup or just go from the Pacemaker cluster setup? If you do not need the PostgreSQL replication setup I do then it would indeed be much shorter, and I would just include it in here.
from paf.
I will have detailed steps and versions posted here by tomorrow.
from paf.
Unless your replication setup is really exotic, or if you want some advice about it, I don't need it...yet.
Thanks,
from paf.
Going to be late this weekend or Monday, could not get that done today, and will have limited access over the weekend.
from paf.
If your RTO/RPO allows manual failover, this is ALWAYS the best solution.
Regards,
from paf.
Related Issues (20)
- Stale DB instance getting promoted when there is a master recovery HOT 4
- Unpromoted master two-node cluster HOT 19
- Compatibility with Postgresql version 13 HOT 2
- Cluster with 3 nodes , 3rd node in different network HOT 1
- Postgres14 support.
- [PCS] postgres9.6 node in blocked state HOT 2
- Failover aborted due to error when trying to stop already stopped old primary HOT 3
- Maintenance status HOT 1
- problem during installation on almalinux 8.5 using postgres13 HOT 6
- pg_rewind automatically HOT 1
- Postgres start as slave in every severs HOT 2
- Auto Failover recovery HOT 4
- doc: add some metadata and manual doc about notify=1 HOT 1
- pgsqld monitor timed out on master when sync slave crashed HOT 3
- Check status of my resources HOT 4
- PAF without db user postgres
- PAF from Centos outdated for newer PCS/pacemaker - ? HOT 4
- PAF, Pacemaker, Postgres 15 and replication slots HOT 1
- PAF, PostgreSQL 15 and Debian 12 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from paf.