Hi! I got new issue. When I testing paf, I randomly reboot nodes and emulate t

I meant checking slave status via master, using pg_stat_replication. From slave

Critical issue: agent ignore replication status from master about paf HOT 10 CLOSED

clusterlabs commented on May 13, 2024

Critical issue: agent ignore replication status from master

from paf.

Comments (10)

ioguix commented on May 13, 2024

Hello,

Something seems definitly wrong with your setup.

The first log you show clearly comes from an old master as it says it is ahead of flush position. So I guess one slave was promote when you switched off your master.

The second slave is not able to catchup with the new master because it can not find a history path between timelines to the current master timeline: the slave was on timeline 9 and the new master is on timeline 13. It probably miss a history file describing how to jump fro one to the other.

As I explained in another issue, PAF is setting a score of 0 to any slave not connected to the master. For now, I'm not sure we should fail them if they are not able to catchup with the master in a during period. Some large slave might take some minutes to catchup with the master after building them. We must be able to detect a real failure, not just some lag, and its harder than just a timeout I guess.

But anyway, fix your setup first :)

from paf.

ARogovsky commented on May 13, 2024

Hi!

I just testing paf for production. I was insert some data on master, so I can't switch off master dye by data loss. On production we can't stop master.

This is real failure, not some lag: slave requied missing WAL:
2016-04-18 10:44:36 MSK LOG: started streaming WAL from primary at 37B/78000000 on timeline 13
2016-04-18 10:44:36 MSK FATAL: could not receive data from WAL stream: ERROR: requested starting point 37B/78000000 is ahead of the WAL flush position of this server 37B/770AEBB8

You can verify infinity loop by command:

LC_ALL=C /usr/lib/postgresql/9.4/bin/pg_controldata /var/lib/postgresql/9.4/main

On master I have this valuses:

pg_control version number:            942
Catalog version number:               201409291
Database system identifier:           6091923344877089459
Database cluster state:               in production
pg_control last modified:             Mon Apr 18 11:59:51 2016
Latest checkpoint location:           37B/770AF898
Prior checkpoint location:            37B/770AF7C0
Latest checkpoint's REDO location:    37B/770AF860
Latest checkpoint's REDO WAL file:    0000000D0000037B00000077
Latest checkpoint's TimeLineID:       13
Latest checkpoint's PrevTimeLineID:   13
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0/101202261
Latest checkpoint's NextOID:          26734551
Latest checkpoint's NextMultiXactId:  1427989
Latest checkpoint's NextMultiOffset:  2859546
Latest checkpoint's oldestXID:        677
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  101202261
Latest checkpoint's oldestMultiXid:   1

On slave:

pg_control version number:            942
Catalog version number:               201409291
Database system identifier:           6091923344877089459
Database cluster state:               in archive recovery
pg_control last modified:             Mon Apr 18 10:50:55 2016
Latest checkpoint location:           37B/78000028
Prior checkpoint location:            37B/78000028
Latest checkpoint's REDO location:    37B/78000028
Latest checkpoint's REDO WAL file:    0000000D0000037B00000078
Latest checkpoint's TimeLineID:       13
Latest checkpoint's PrevTimeLineID:   13
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0/101202258
Latest checkpoint's NextOID:          26734551
Latest checkpoint's NextMultiXactId:  1427989
Latest checkpoint's NextMultiOffset:  2859546
Latest checkpoint's oldestXID:        677
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  0
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1

As you can see - slave have diffirent checkpoint location than master, and master cant sent WAL.

Of course, this is wrong setup, but I was test paf for production, for be sure it work correct in wrong setup too.

from paf.

ioguix commented on May 13, 2024

Ok, you triggered this error on purpose.

Let's imagine PAF should detect such situation and fail the slave by returning an error during the monitor action. That means:

1/ PAF must be aware a master exists, we are not in a startup process where every node is slave
2/ PAF is able to detect the slave is not able to catchup with the master

This second item is currently hard to check as there is no stats concerning the replication status from slaves. The only solution would be to parse the log file (bad solution, will not do) and being able to understand every possible failing messages to react accordingly (even worst).

Feel free to share with us if you have a suggestion.

from paf.

ARogovsky commented on May 13, 2024

Hi!
I'm not a professional programmer, so I have only "dirty solution":

Checking when slave lost replication sync/async status
Wait some time for give grace to catchup data
Checking slave status from master - if it not exist, then mark slave failed

from paf.

ioguix commented on May 13, 2024

It seems that is what I described in my previous, and it's not god enough :(

1.you mean checking from the slave? Checking the replication status from a slave is currently not possible as PostgreSQL doesn't give any stats about this on a standby. The only solution would be to check from the log files which is really a bad solution
2. How much time should be give it? What if you just built a slave from a busy and large master and the slave has tens or hundreds of gigabyte to catchup? :(
3. already done during the "monitor" action. If slave is not replicating, its master score is set to 0 so it can not be a candidate to promotion should a failure occurs on the master.

from paf.

ARogovsky commented on May 13, 2024

I meant checking slave status via master, using pg_stat_replication. From slave need checking pg_last_xlog_replay_location. Also there is some examples for calculate lag per second and other feathures: https://vibhorkumar.wordpress.com/2014/05/21/monitoring-approach-for-streaming-replication-with-hot-standby-in-postgresql-9-3/
This is very simple. I execute catchup for 300G db usng initial replica and check status on master:

postgres@a:~$ psql -h 192.168.10.200 -P pager=off -q -A -c "SELECT * from pg_stat_replication"
pid|usesysid|usename|application_name|client_addr|client_hostname|client_port|backend_start|backend_xmin|state|sent_location|write_location|flush_location|replay_location|sync_priority|sync_state
332|10|postgres|pg_basebackup|192.168.10.3||49513|2016-04-18 13:55:43.781393+03||backup|0/0|37B/77085E80|37B/77085E80|37B/77085E80|0|async
335|10|postgres|pg_basebackup|192.168.10.3||49514|2016-04-18 13:55:44.37728+03||streaming|37B/7800E598|37B/7800E598|||0|async

So, if slave is really catchup - it will show in pg_stat_replication.

1. But I need failed non-replicate slaves. I dont want lost data in production.

from paf.

mrimbault commented on May 13, 2024

Hi!

I have several remarks about this issue.

First, you can indeed monitor replication status from the master using pg_stat_replication view, but only for the slaves that are actually connected. In your case, your old master that was disconnected can't reconnect again without a rebuild, thus the error messages it throws in its log file:

 2016-04-18 10:44:36 MSK FATAL: could not receive data from WAL stream: ERROR: requested starting point 37B/78000000 is ahead of the WAL flush position of this server 37B/770AEBB8

That means that from the master node, we will only see that the standby on the "old master" node is started, but not connected. That is not enough information to conclude it is failing, as ioguix mentioned earlier, it could be catching up, for example using a log shipping mechanism (and that will not be visible from the pg_stat_replication view).
You could also suffer from a temporary streaming disconnection on an otherwise sane standby, and I don't think marking the disconnected standby as FAILED (which would exclude it from the cluser) would be a good response.

At this point, PAF detects that the standby node (your "old master") is not connected, and gives it a promotion score of 0 to avoid promoting it in case of a failover.
It should also throw some WARNING messages on the Pacemaker log files, on the master node.

My second point is about your requirement to not having any data loss.
By design, PAF does not offer any guarantee about this.
The whole point of a resource agent is to automate the failover process, thus minimizing the RTO (Recovery Target Objective).
In your PostgreSQL's configuration (not related to PAF configuration), you are using asynchronous replication.
That means that transactions will be committed on the master before they can be applied on the standbys. In case of a failover, the most up-to-date standby will be promoted by PAF, thus minimizing the data loss, but there will still be some loss.

If you want to really minimize the RPO (Recovery Point Objective), then you want something else, maybe in addition to PAF.
What I can suggest is that you configure your PostgreSQL instances using synchronous replication.
That should guarantee to minimize any data loss in case of a failover (PAF will always chose the most up-to-date standby when failing over the resource), and in most cases you should suffer no data loss at all (except if both the master and the synchonized standby suffer a failure).
But that configuration will come at the cost of the performances, as every synchronous transaction on your master will have to wait it is written on a standby before it can return a success status.

In conclusion, remember that service high availability (what PAF provides) and data high availability (what PostgreSQL's synchronous replication provides) are two different concepts, and sometimes mutually exclusive.

from paf.

mrimbault commented on May 13, 2024

Also, to clarify what I meant when I was saying that service high availability and data high availability can be mutually exclusive, please consider that, when using PostgreSQL's synchronous replication with one ore more standbys, if at any point none of those standbys are available (maybe due to a network problem, or to a failure on their nodes), then all writing transactions on your master will wait until at least on standby is connected.
That means that you protect yourself against data loss, but you risk having more service interruption.

from paf.

ARogovsky commented on May 13, 2024

Allright, I understand your point.

I have diffirent question.
For example on one node pgsql slave is failed. How to correct exclude this failed slave from cluster for pg_basebackup and include it back?
I need to pgsql service will be stopeed, but I want keep other services (apache, etc) online.

from paf.

ARogovsky commented on May 13, 2024

Allright, I just remove node total from cluster, sync and come back it to cluster.

from paf.

Critical issue: agent ignore replication status from master about paf HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent