clusterlabs / paf Goto Github PK

View Code? Open in Web Editor NEW

327.0 25.0 51.0 2.45 MB

PostgreSQL Automatic Failover: High-Availibility for Postgres, based on Pacemaker and Corosync.

Home Page: http://clusterlabs.github.io/PAF/

License: Other

Perl 67.17% Makefile 1.43% Shell 31.40%

pacemaker postgresql postgres resource-agent paf failover high-availability

paf's People

Contributors

Stargazers

Watchers

Forkers

cleaton ioguix yanchii credativ cuulee blogh pgiraud jamestzeng alfiyazi sjas pabit zhuomingliang blunney1 dalibo breadware-software jjn1056 frost242 rogervaas anayrat cccnam5158 dud225 stephensorriaux matthewkohner adamnugraha albertofanton mrimbault bin2000 inkoit jmcabandara piyushagarwal1 dashng dattatraya-nirmal gpaulavicius yuanpx hadryan cybernetics zidan137 filiprem-fb sjstoelting bugfyi zht750808 pietervanvliet khattori frbn mixton qqq-tech oalbrigt vijayk-p-maersk neerajsingh842006 gavinthinking

paf's Issues

ocf:heartbeat:pgsqlms resource

Hello;
I wanna to learn when we use this resource replication is sync or async ?

Precision about resource ordering

Hello

The howto states that :
WARNING: in step 4, the start/stop and promote/demote order for these resources must be asymetrical: we must keep the master IP on the master during its demote process.
I don't understand quite well this statement in particular regarding the demote action. At the beginning the howto suggests to reject self-replication :
host replication postgres VIP/32 reject
so far as I understand, the slave will be unable to connect to the master whether the master IP is here or not.

Could you give me some explanation about that ordering requirement ?

thanks
dud

PS : many many thanks for this awesome work ;)

Debian packaging of 2.1

todo:

build Debian packages we provide on gh
ping @ge-fa for integration in stretch if released before 5th of january 2017 (see #13)

Release 2.0 rc1

As soon as the documentation is up to date, release a rc of v2.0

Implement datadir in 2.1

See PR #35 from @blogh on branch v1.1.

Modify quickstart doc

Some modifications discussed together :

modify vip name pgsql-ha to pgsql-vip
~~move to recovery_template to /etc~~
~~move pg_hba.conf to /etc~~
add a warning about pg_hba.conf and recovery_template in PGDATA
update quickstart to use PostgreSQL 9.6

Avoid crm_attribute and crm_master during transitions

Because this commands set transient attributes, it frequently breaks the transitions and force the PEngine to recompute a new transition taking over the last one.

This is annoying as it changes the notify environment variables in the middle of a transition and require more cluster communication.

crm_master should be called only during the regular monitor action call.

crm_attribute can be replaced with attrd_updater.

Master role with msater_score != 1001

In some setup after a switchover/failover, some race condition can leads to situation where the new master find itself in pg_stat_replication.

Because of this, it decreases its own score to 1000 or 1.

Avoid such situation in the _check_locations sub.

Document softdog/watchdog

This kind of Stonith by node suicide might be useful for tests sessions or stronger setup.

It deserves some documentation, and maybe even to be integrated in the Quick Start pages.

PAF Auto Failback Support

Hi, my setup has only two node. My purpose is making active/passive postgresql cluster. If master is unreachable then slave must serve until master come back.

I have two server (Centos 6.6) to test this situation. I installed the PAF as you documented, and i only changed:

pcs -f cluster1.xml resource master pgsql-ha pgsqld
master-max=1 master-node-max=1
clone-max=3 clone-node-max=1 notify=true

pcs -f cluster1.xml resource master pgsql-ha pgsqld
master-max=1 master-node-max=1
clone-max=2 clone-node-max=1 notify=true

For testing purpose, disconnect the lan connection of master and monitored the situation from "pcs status", shared ip assigned to the slave and slave became Master.

After enabling the lan connection of the old Master, old Master became Master again, but in postgresql logs of slave server says:

< 2016-09-04 17:39:06.245 EEST >LOG: entering standby mode
< 2016-09-04 17:39:06.255 EEST >LOG: consistent recovery state reached at 0/3015B28
< 2016-09-04 17:39:06.255 EEST >LOG: invalid record length at 0/3015B28
< 2016-09-04 17:39:06.255 EEST >LOG: database system is ready to accept read only connections
< 2016-09-04 17:39:06.257 EEST >FATAL: pg_hba.conf rejects replication connection for host "10.0.0.36", user "postgres", SSL off
< 2016-09-04 17:39:06.258 EEST >FATAL: could not connect to the primary server: FATAL: pg_hba.conf rejects replication connection for host "10.0.0.36", user "postgres", SSL off

< 2016-09-04 17:39:21.274 EEST >FATAL: highest timeline 2 of the primary is behind recovery timeline 3
< 2016-09-04 17:39:26.281 EEST >FATAL: highest timeline 2 of the primary is behind recovery timeline 3
< 2016-09-04 17:39:31.279 EEST >FATAL: highest timeline 2 of the primary is behind recovery timeline 3
< 2016-09-04 17:39:36.287 EEST >FATAL: highest timeline 2 of the primary is behind recovery timeline 3
< 2016-09-04 17:39:41.290 EEST >FATAL: highest timeline 2 of the primary is behind recovery timeline 3
< 2016-09-04 17:39:46.297 EEST >FATAL: highest timeline 2 of the primary is behind recovery timeline 3
< 2016-09-04 17:39:51.300 EEST >FATAL: highest timeline 2 of the primary is behind recovery timeline 3
< 2016-09-04 17:39:56.306 EEST >FATAL: highest timeline 2 of the primary is behind recovery timeline 3
< 2016-09-04 17:40:01.312 EEST >FATAL: highest timeline 2 of the primary is behind recovery timeline 3
< 2016-09-04 17:40:06.316 EEST >FATAL: highest timeline 2 of the primary is behind recovery timeline 3
< 2016-09-04 17:40:11.321 EEST >FATAL: highest timeline 2 of the primary is behind recovery timeline 3
< 2016-09-04 17:40:16.324 EEST >FATAL: highest timeline 2 of the primary is behind recovery timeline 3
< 2016-09-04 17:40:21.330 EEST >FATAL: highest timeline 2 of the primary is behind recovery timeline 3
< 2016-09-04 17:40:26.336 EEST >FATAL: highest timeline 2 of the primary is behind recovery timeline 3
< 2016-09-04 17:40:31.342 EEST >FATAL: highest timeline 2 of the primary is behind recovery timeline 3
< 2016-09-04 17:40:36.347 EEST >FATAL: highest timeline 2 of the primary is behind recovery timeline 3
< 2016-09-04 17:40:41.353 EEST >FATAL: highest timeline 2 of the primary is behind recovery timeline 3
< 2016-09-04 17:40:46.357 EEST >FATAL: highest timeline 2 of the primary is behind recovery timeline 3
< 2016-09-04 17:40:51.363 EEST >FATAL: highest timeline 2 of the primary is behind recovery timeline 3

Is this normal? Is it impossble to sync the old master with the new master before making old master to Master again?

If it is impossible, can we prevent the old master become to Master again after it come back? What do you offer to use in this situation? If we prevent the old master become to Master again after come back, can we sync it with the new master using pg_rewind tool?

Also, i did not config the corosync.conf and you said that "pcs cluster setup" does it for me, but i do not have "/etc/corosync/corosync.conf" file. Is this normal? Where can i write the "two_node: 1" ? I searched and i have "two_node" setting in "/etc/cluster/cluster.conf" as:

...<!cman broadcast="no" expected_votes="1" transport="udp" two_node="1"/>

is it enogh?

Also, my servers are on the vmware and in my setup i have not access to hypervisor. I can not use "fence_virsh" etc. Because of this, i disabled the STONITH like:

pcs property set stonith-enabled=false

after pushing the cib xml.

But event this, i got this repetitive errors in "/var/log/messages":

Sep 4 17:41:04 node_01 fence_pcmk[9679]: Call to fence node_02 (reset) failed with rc=237
Sep 4 17:41:04 node_01 stonith-ng[5395]: notice: Operation reboot of node_01 by for stonith_admin.cman.8195@node_02.5aa18e64: No such device
Sep 4 17:41:04 node_01 crmd[5399]: notice: Peer node_01 was not terminated (reboot) by for node_02: No such device (ref=5aa18e64-e2c3-4be8-a2b0-c367ea7ef977) by client stonith_admin.cman.8195
Sep 4 17:41:07 node_01 fence_pcmk[9715]: Requesting Pacemaker fence node_02 (reset)
Sep 4 17:41:07 node_01 stonith-ng[5395]: notice: Client stonith_admin.cman.9716.089f30fb wants to fence (reboot) 'node_02' with device '(any)'
Sep 4 17:41:07 node_01 stonith-ng[5395]: notice: Initiating remote operation reboot for node_02: 7e817e63-0497-42e9-b09e-0e4a22ae8ef8 (0)
Sep 4 17:41:07 node_01 stonith-ng[5395]: notice: Couldn't find anyone to fence (reboot) node_02 with any device
Sep 4 17:41:07 node_01 stonith-ng[5395]: error: Operation reboot of node_02 by for stonith_admin.cman.9716@node_01.7e817e63: No such device
Sep 4 17:41:07 node_01 crmd[5399]: notice: Peer node_02 was not terminated (reboot) by for node_01: No such device (ref=7e817e63-0497-42e9-b09e-0e4a22ae8e89) by client stonith_admin.cman.9716

Also in my "/etc/cluster/cluster.conf" file like this:

<!cluster config_version="9" name="cluster_pgsql">
<!fence_daemon/>
<!clusternodes>
<!clusternode name="node_01" nodeid="1">
<!fence>
<!method name="pcmk-method">
<!device name="pcmk-redirect" port="node_01"/>

<!clusternode name="node_02" nodeid="2">
<!fence>
<!method name="pcmk-method">
<!device name="pcmk-redirect" port="node_02"/>

<!cman broadcast="no" expected_votes="1" transport="udp" two_node="1"/>
<!fencedevices>
<!fencedevice agent="fence_pcmk" name="pcmk-redirect"/>

<!rm>
<!failoverdomains/>
<!resources/>

I add "!" after the "<" sign, else the text editor of github hides them.

Should i remove the lines which refers to fencing like "..." and "..." ?

Thanks for your helps.

Sync replication and replication slot problem

Hello,

For one of our customer we have to use sync with synchronous_standby_names = '*' in postgresql.conf combine with replication slot in database.

There is a problem when one standby leave the cluster, replication slot stays in DB and PGSQL keeps its wal, due to the synchronous_standby_names in config, untill the standby come back. If the stanbdy is not present for a long time, wal directory will grow untill partition is full...

So do you think it'll be possible to manage replication slot with PAF RA ?

Regards,
Pascal.

Next best secondaries master scores

In sub _check_locations, we order the secondaries connected to the primary
based on their LSN diff (rownum), update the master score of the most
up-to-date to "1000", and update the other nodes master score to "1".
At this point we chosed the best candidate, but what if we
suffer a failure that removes it along with the current
primary ?
To deal better with this case, we could easily change this part so that each
node besides the best one receives a score based on their nownum.
The final score for each node should also be set high enough
to avoid location constraint and initial secondary score (1) to
kick in.
"1000" (or whatever value we chose) should stay the absolue value for
the better choice, so it can be used again during pgsql_promote.

Stats_temp_directory not existing after server restart under Debian 8

On a Debian 8 box, a PostgreSQL instance created using Debian packaged tool pg_createcluster (like the instance automatically created at the package installation) is configured with the following parameter [1] :

stats_temp_directory = '/var/run/postgresql/9.5-main.pg_stat_tmp'

The problem is that /var/run is a symlink to /run, which is a tmpfs.
So if the box reboots, the directory does not exist anymore.
An instance managed by Debian utilities like pg_ctlcluster recreates it at startup, but PAF uses PostgreSQL's pg_ctl command, not Debian specific tools, so the PostgreSQL instance can not be started again until this directory has been manually created.

I see no easy workaround here, so maybe we should add on PAF's prerequisites that the directory configured for stats_temp_directory should not be under a tmpfs (but still it can be the mount point of a tmpfs).
If so, we also need to fix the Debian 8 Quick start to adapt this parameter.

[1]: Cf. the related Debian bug report that leads to the inclusion of this feature on Debian packaged PostgreSQL.

MAster node doesnt start after fail.

I have create 3 node postgesql cluster .(srv1(master),srv2(slave),srv3(slave) for testing.
I disable master node(srv1) network and then fence device power of master node . And then second node (srv2) become master so It worked properly up to this point.
After i open my old master node srv1 when i try to start srv1 it has gt error and it have power of autamatically.( i run "pcs cluster start srv1" command.)

To make srv1 node as a master again what can i do.
I have used postgresql 9.4 and i used yours setup(http://dalibo.github.io/PAF/Quick_Start-CentOS-7.html)

Best Regards.

PAF dont check slave status on master

Hi. I have deep test PAF and find problem.
One my slave lost replication due by unexcepted shutdown. After start it write in log:

2016-04-14 10:42:21 MSK LOG:  started streaming WAL from primary at 37B/75000000 on timeline 12
2016-04-14 10:42:21 MSK FATAL:  could not receive data from WAL stream: ERROR:  requested starting point 37B/75000000 is ahead of the WAL flush position of this server 37B/7408C5F8

2016-04-14 10:42:21 MSK LOG:  received fast shutdown request
2016-04-14 10:42:21 MSK LOG:  aborting any active transactions
2016-04-14 10:42:21 MSK FATAL:  terminating connection due to administrator command
2016-04-14 10:42:21 MSK LOG:  shutting down
2016-04-14 10:42:21 MSK LOG:  database system is shut down

pg_lsclusters who db is online, but in fact db is down.

Checking query for master like SELECT client_addr,sync_state from pg_stat_replication confirm replication lost.I belive, this status may receive via slave/status code or other thing.
Your RAF should checking status and down failed slaves db.

Avoid setting lsn_location during switchover

Currently, only the designated master-to-be avoids setting the lsn_location private when detecting the swithcover from the old master to itself. Other slaves (or the old master) do not figure the switchover and set their current lsn_location.

This is just a small improvement preventing some useless operations, but worth to be fixed in my opinion to have a cleaner, safer and more comprehensive behavior of the RA.

incorrect time units in pg_ctl timeout value

Hi,
there is an inconsistency in interpreting timeout values. The cluster settings require value in seconds but the agent itself receives timeout value in miliseconds.

With the following setup

root@cluster1:~# pcs resource show postgres-ha
 Master: postgres-ha
  Meta Attrs: master-node-max=1 master-max=1 clone-max=2 notify=True clone-node-max=1 target-role=Stopped
  Resource: postgres (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: pgdata=/var/lib/pgsql/9.5/data pgport=5432 bindir=/usr/pgsql-9.5/bin
   Operations: start interval=0s timeout=60s (postgres-start-interval-0s)
               stop interval=0s timeout=60s (postgres-stop-interval-0s)
               promote interval=0s timeout=30s (postgres-promote-interval-0s)
               demote interval=0s timeout=120s (postgres-demote-interval-0s)
               notify interval=0s timeout=60s (postgres-notify-interval-0s)
               monitor interval=15s role=Master timeout=10s (postgres-monitor-interval-15s)
               monitor interval=16s role=Slave timeout=10s (postgres-monitor-interval-16s)

E.g. for the start command, the OCF_RESKEY_CRM_meta_timeout variable is 60000.

And the resulting start command looks like:
DEBUG: _runas: launching as "postgres" command "/usr/pgsql-9.5/bin/pg_ctl --pgdata /var/lib/pgsql/9.5/data -w --timeout 60060 start"

Jan

possible race condition when setting attributes during actions

Following a discussion on the Pacemaker mailing list, setting attributes is an asynchronous action. This might lead to situation were:

one action decide to set an attribute, the call is processed asynchronously by attrd
the next action is executed and do not find this attribute, leading to an error
attrd set the attribute and make it available from all the nodes.

Because of this asynchronous behavior, we should wrap the calls of crm_attribute or attrd_updater in a sub to simulate a synchronous call by waiting for the attribute and its new value to be available before returning back to the normal code.

9.5 pg_controldata output not compatible with PAF

In 9.5 version of PostgreSQL, the output of pg_controldata changed.
PAF RA expects wal_level related line to start with "Current wal_level setting:", but it now start with "wal_level setting:".

This breaks the agent, that fails immediately with a "OCF_ERR_INSTALLED" return code, bringing the resource down.

The problem occurs at this line: https://github.com/dalibo/PAF/blob/master/script/pgsqlms#L856

This check needs to be adapted.

ocf:heartbeat:pgsqlms resource

Hello I wanna to learn when we use pgsqlms resource , replication is sync or async between database nodes ?

I used to this config :

pgsqld

pcs -f cluster1.xml resource create pgsqld ocf:heartbeat:pgsqlms
bindir=/usr/pgsql-9.3/bin pgdata=/var/lib/pgsql/9.3/data
op start timeout=60s
op stop timeout=60s
op promote timeout=30s
op demote timeout=120s
op monitor interval=15s timeout=10s role="Master"
op monitor interval=16s timeout=10s role="Slave"
op notify timeout=60s \

pgsql-ha

pcs -f cluster1.xml resource master pgsql-ha pgsqld
master-max=1 master-node-max=1
clone-max=3 clone-node-max=1 notify=true

Deal with OCF_RESKEY_CRM_meta_notify_active_* fix in OCF_Functions.pm

Pacemaker project fixed the bug with OCF_RESKEY_CRM_meta_notify_active_* being always empty: http://bugs.clusterlabs.org/show_bug.cgi?id=5295

Today, we bypass this bug directly in pgsqlms with a specific code. We should deal with it in OCF_Functions.pm, using different code for Pacemaker < 1.1.16 and >= 1.1.16 and keep the RA code clean.

See:

Initial master score for started secondary

In sub pgsql_start, after successfully starting the instance as a secondary, we set its master score to "1".
This is mandatory for the first startup of the cluster to succeed, as at this point there is no master yet, so Pacemaker has to chose one of the secondaries to promote.
But as soon as we have promoted one instance, the master monitor will automatically update the master score for the secondaries connected through streaming replication.
As a freshly started secondary will never have connected, and will certainly have some lag with an existing primary, it should probably be safer to avoid its promotion if any better choice exists.

We could set the master score of a starting secondary to 1 only if OCF_PRIMARY_NODE is not set, and set it to 0 otherwise.
It may not be necessary though, as even if this instance is selected for promotion before it could connect to the primary, the promotion action itself will take care to detect if there is a better promotion candidate available.

Check the recovery.conf file existence after promotion

In sub pgsql_promote, right after the promotion has succeeded, we check for a recovery.conf file existence, and if it does we fail the promotion.
The recovery.conf file exitance does not mean that the promotion failed, but it do seems that something went wrong. Should we really fail the promotion based on this file existence, maybe a warning message would be enough ?

Also, why only check this at this after promotion ?
Should this test be put during the monitor instead, after we confirmed the instance is a primary ?

manual failover when all servers go down and only a standby comes back.

Hi,

I am testing a senario with 2 servers where both the standby and the master go down (in that order).
Then only the slave goes back online.

The standby is never promoted as a primary (I can understand why).

The only way I found to do it was to issue the "pg_ctl promote -D $PGDATA" myself. (<=> manual failover).
Is there a way to do it with the crm commands ? (just moving the resource didn't do it).

Benoit.

Conf: Ubuntu 14.04 // PG 9.4 // Corosync 2.3.3 // Pacemaker 1.1.10 // PAF 1.1

PAF not compatible with EL6 distros

As discussed in issue #9, recent devel activity broke the support with Pacemaker 1.1.12, currently packaged and used on EL6 distros.

The incompatibility comes from the --private argument of the attrd_updater command tool.

We should try to refactor the code using attrd_updater --private to support EL6 again in next PAF release.

Move pgsqlms to a different namespace?

We are currently installing pgsqlms to the heartbeat namespace.

This choice is arguable for some reasons:

it seems the heartbeat namespace is expected to host official RA from the resource-agents project
heartbeat is a pretty misleading namespace as heartbeat is long deprecated in favor of Pacemaker
there are on-going discussions about this heartbeat namespace (deprecating it? moving official RA to the Pacemaker namespace?). See: https://www.mail-archive.com/[email protected]/msg00241.html

We should probably create the PAF namespace and install pgsqlms in there by default.

Error level when controldata is not consistent with pg_isready

In sub pgsql_monitor, we raise soft errors (OCF_ERR_GENERIC) when we confirm an inconsistency between controldata and pg_isready.
This result probably means that something really wrong is happening.
Should we raise a hard error here instead so the resource goes down immediately ?

After promote PAF forget delete recovery.conf?

Hi
I was shutdown master node and PAF try promote on new node:
May 24 11:12:55 b pgsqlms(pgsqld)[9088]: INFO: pgsql_notify: promoting instance on node "b"
...
May 24 11:13:00 b pgsqlms(pgsqld)[9545]: INFO: pgsql_notify: promoting instance on node "b"

File recovery.conf is present in not deleted. I thinking after promote need remove recovery.conf?

Verify that notify=true in cluster configuration

For now, PAF will not work properly if you don't set notify=true when you create the pgsqlms resource.

Create Debian Package

This issue is just a reminder to create the debian package when releasing 1.1.0.

Wrong conversation from hex to dec of LSN

Since 9.3, PostgreSQL uses 256 segments per WAL, not 255. That means a WAL size is 4GB, not 4GB-16MB.

In conséquence, the following conversation is false:

$max_lsn_dec = (hex('ff000000') * hex($wal_num)) + hex($wal_off);

it should be $max_lsn_dec = ( 4_1024_1024*1024 * hex($wal_num)) + hex($wal_off);

This bug should not have large impact in production, but it would be better to fix it.

Test PAF is failed

Hi!
I run test and it failed. Can you help me?

root@a:~/PAF-v1.0.0/t# export PGDATA=/tmp/pgdata1 PGBIN=/usr/lib/postgresql/9.4/bin RESOURCE_NAME=pgsqld 
root@a:~/PAF-v1.0.0/t#  ocft test -v pgsqlms
Initializing 'pgsqlms' ...
PGBIN: /usr/lib/postgresql/9.4/bin
Done.

Starting 'pgsqlms' case 0 'check validate-all':
Setting agent environment:    export OCF_RESKEY_bindir=/usr/lib/postgresql/9.4/bin
Setting agent environment:    export OCF_RESKEY_pgdata=/tmp/pgdata1
Setting agent environment:    export OCF_RESOURCE_INSTANCE=pgsqld
Setting agent environment:    export OCF_RESKEY_primary_node=
Running agent:                ./pgsqlms validate-all
Checking return value:        OK. The return value 'OCF_SUCCESS' == 'OCF_SUCCESS'

Starting 'pgsqlms' case 1 'check stopped monitor':
Setting agent environment:    export OCF_RESKEY_bindir=/usr/lib/postgresql/9.4/bin
Setting agent environment:    export OCF_RESKEY_pgdata=/tmp/pgdata1
Setting agent environment:    export OCF_RESOURCE_INSTANCE=pgsqld
Setting agent environment:    export OCF_RESKEY_primary_node=
Running agent:                ./pgsqlms monitor
Checking return value:        OK. The return value 'OCF_NOT_RUNNING' == 'OCF_NOT_RUNNING'

Starting 'pgsqlms' case 2 'check start':
Setting agent environment:    export OCF_RESKEY_bindir=/usr/lib/postgresql/9.4/bin
Setting agent environment:    export OCF_RESKEY_pgdata=/tmp/pgdata1
Setting agent environment:    export OCF_RESOURCE_INSTANCE=pgsqld
Setting agent environment:    export OCF_RESKEY_primary_node=
Running agent:                ./pgsqlms start
Checking return value:        FAILED. The return value 'OCF_ERR_GENERIC' != 'OCF_SUCCESS'. See details below:
pgsqlms(default)[2900]: 2016/03/25_14:04:33  DEBUG: _runas: launching as "postgres" command "/usr/lib/postgresql/9.4/bin/pg_isready -h /tmp -p 5432"
/tmp:5432 - no response
pgsqlms(default)[2900]: 2016/03/25_14:04:33  DEBUG: pgsql_monitor: instance "default" is not listening
pgsqlms(default)[2900]: 2016/03/25_14:04:33  DEBUG: _runas: launching as "postgres" command "/usr/lib/postgresql/9.4/bin/pg_ctl -D /tmp/pgdata1 status"
pg_ctl: no server running
pgsqlms(default)[2900]: 2016/03/25_14:04:33  DEBUG: _confirm_stopped: no postmaster process found for instance "default"
pgsqlms(default)[2900]: 2016/03/25_14:04:33  DEBUG: _controldata: instance "default" state is "shut down"
pgsqlms(default)[2900]: 2016/03/25_14:04:33  DEBUG: _confirm_stopped: instance "default" controldata indicates that the instance was propertly shut down
pgsqlms(default)[2900]: 2016/03/25_14:04:33  DEBUG: pgsql_start: instance "default" is not running, starting it as a secondary
pgsqlms(default)[2900]: 2016/03/25_14:04:33  DEBUG: _create_recovery_conf: get replication configuration from the template file "/tmp/pgdata1/recovery.conf.pcmk"
pgsqlms(default)[2900]: 2016/03/25_14:04:33  DEBUG: _create_recovery_conf: write the replication configuration to "/tmp/pgdata1/recovery.conf" file
pgsqlms(default)[2900]: 2016/03/25_14:04:33  DEBUG: _runas: launching as "postgres" command "/usr/lib/postgresql/9.4/bin/pg_ctl -D /tmp/pgdata1 -w start"
waiting for server to start....LOG:  database system was shut down at 2016-03-25 14:04:27 MSK
LOG:  entering standby mode
LOG:  consistent recovery state reached at 0/16A8378
LOG:  record with zero length at 0/16A8378
LOG:  database system is ready to accept read only connections
FATAL:  could not connect to the primary server: could not connect to server: Connection refused
Is the server running on host "127.0.0.1" and accepting
TCP/IP connections on port 15432?

done
server started
pgsqlms(default)[2900]: 2016/03/25_14:04:34  DEBUG: _runas: launching as "postgres" command "/usr/lib/postgresql/9.4/bin/pg_isready -h /tmp -p 5432"
/tmp:5432 - no response
pgsqlms(default)[2900]: 2016/03/25_14:04:34  DEBUG: pgsql_monitor: instance "default" is not listening
pgsqlms(default)[2900]: 2016/03/25_14:04:34  DEBUG: _runas: launching as "postgres" command "/usr/lib/postgresql/9.4/bin/pg_ctl -D /tmp/pgdata1 status"
pg_ctl: server is running (PID: 2930)
/usr/lib/postgresql/9.4/bin/postgres "-D" "/tmp/pgdata1"
2016/03/25_14:04:34  ERROR: _confirm_stopped: instance "default" is not listening, but the process referenced in postmaster.pid exists
2016/03/25_14:04:34  ERROR: pgsql_start: instance "default" is not running as a slave (returned 1)

Starting 'pgsqlms' case 3 'check double start':
Setting agent environment:    export OCF_RESKEY_bindir=/usr/lib/postgresql/9.4/bin
Setting agent environment:    export OCF_RESKEY_pgdata=/tmp/pgdata1
Setting agent environment:    export OCF_RESOURCE_INSTANCE=pgsqld
Setting agent environment:    export OCF_RESKEY_primary_node=
ERROR: './pgsqlms monitor' failed, the return code is 1.
Starting 'pgsqlms' case 4 'check stop':
Setting agent environment:    export OCF_RESKEY_bindir=/usr/lib/postgresql/9.4/bin
Setting agent environment:    export OCF_RESKEY_pgdata=/tmp/pgdata1
Setting agent environment:    export OCF_RESOURCE_INSTANCE=pgsqld
Setting agent environment:    export OCF_RESKEY_primary_node=
ERROR: './pgsqlms monitor' failed, the return code is 1.
Starting 'pgsqlms' case 5 'check double stop':
Setting agent environment:    export OCF_RESKEY_bindir=/usr/lib/postgresql/9.4/bin
Setting agent environment:    export OCF_RESKEY_pgdata=/tmp/pgdata1
Setting agent environment:    export OCF_RESOURCE_INSTANCE=pgsqld
Setting agent environment:    export OCF_RESKEY_primary_node=
ERROR: './pgsqlms monitor' failed, the return code is 1.
Starting 'pgsqlms' case 6 'check slave monitor':
Setting agent environment:    export OCF_RESKEY_bindir=/usr/lib/postgresql/9.4/bin
Setting agent environment:    export OCF_RESKEY_pgdata=/tmp/pgdata1
Setting agent environment:    export OCF_RESOURCE_INSTANCE=pgsqld
Setting agent environment:    export OCF_RESKEY_primary_node=
ERROR: './pgsqlms monitor' failed, the return code is 1.
Starting 'pgsqlms' case 7 'check promote':
Setting agent environment:    export OCF_RESKEY_bindir=/usr/lib/postgresql/9.4/bin
Setting agent environment:    export OCF_RESKEY_pgdata=/tmp/pgdata1
Setting agent environment:    export OCF_RESOURCE_INSTANCE=pgsqld
Setting agent environment:    export OCF_RESKEY_primary_node=
ERROR: './pgsqlms monitor' failed, the return code is 1.
Starting 'pgsqlms' case 8 'check double promote':
Setting agent environment:    export OCF_RESKEY_bindir=/usr/lib/postgresql/9.4/bin
Setting agent environment:    export OCF_RESKEY_pgdata=/tmp/pgdata1
Setting agent environment:    export OCF_RESOURCE_INSTANCE=pgsqld
Setting agent environment:    export OCF_RESKEY_primary_node=
ERROR: './pgsqlms monitor' failed, the return code is 1.
Starting 'pgsqlms' case 9 'check master monitor':
Setting agent environment:    export OCF_RESKEY_bindir=/usr/lib/postgresql/9.4/bin
Setting agent environment:    export OCF_RESKEY_pgdata=/tmp/pgdata1
Setting agent environment:    export OCF_RESOURCE_INSTANCE=pgsqld
Setting agent environment:    export OCF_RESKEY_primary_node=
ERROR: './pgsqlms monitor' failed, the return code is 1.
Starting 'pgsqlms' case 10 'check demote':
Setting agent environment:    export OCF_RESKEY_bindir=/usr/lib/postgresql/9.4/bin
Setting agent environment:    export OCF_RESKEY_pgdata=/tmp/pgdata1
Setting agent environment:    export OCF_RESOURCE_INSTANCE=pgsqld
Setting agent environment:    export OCF_RESKEY_primary_node=
ERROR: './pgsqlms monitor' failed, the return code is 1.
Starting 'pgsqlms' case 11 'check double demote':
Setting agent environment:    export OCF_RESKEY_bindir=/usr/lib/postgresql/9.4/bin
Setting agent environment:    export OCF_RESKEY_pgdata=/tmp/pgdata1
Setting agent environment:    export OCF_RESOURCE_INSTANCE=pgsqld
Setting agent environment:    export OCF_RESKEY_primary_node=
ERROR: './pgsqlms monitor' failed, the return code is 1.
Cleaning 'pgsqlms' ...
Done.

Question about resource status after different failed restart scenarios

Hi,

I tested the following scenario. I am new to pacemaker so i am not sure the result is intended (and if the test is worth something), i am also not 100% sure if it's tied to the agent or a on-fail clause of the configuration. Sorry for the disturbance if it's normal or unrelated to the agent.

I use:

Ubuntu 14.04
Pacemaker 1.1.10 / Corosync 2.3.3
Paf 1.0.2
PostgreSQL 9.4

Cluster in nominal state:

 Resource Group: master-group
     vip-master (ocf::heartbeat:IPaddr2):       Started x64lmwbio9f-priv
     vip-rep    (ocf::heartbeat:IPaddr2):       Started x64lmwbio9f-priv
 Master/Slave Set: msPostgresql [pgsqlsrv]
     Masters: [ x64lmwbio9f-priv ]
     Slaves: [ x64lmwbio9g-priv ]
 Clone Set: fencing [st-null]
     Started: [ x64lmwbio9f-priv x64lmwbio9g-priv ]

Stop the node x64lmwbio9g-priv with "sudo service pacemaker stop"
Modify the postgresql.conf to introduce an error during startup.
Start the node x64lmwbio9g-priv with "sudo service pacemaker start"

The cluster ends up like this. (i wasn't surprised)

 Resource Group: master-group
     vip-master (ocf::heartbeat:IPaddr2):       Started x64lmwbio9f-priv
     vip-rep    (ocf::heartbeat:IPaddr2):       Started x64lmwbio9f-priv
 Master/Slave Set: msPostgresql [pgsqlsrv]
     Masters: [ x64lmwbio9f-priv ]
     Stopped: [ x64lmwbio9g-priv ]
 Clone Set: fencing [st-null]
     Started: [ x64lmwbio9f-priv x64lmwbio9g-priv ]

Now if i do the following:

Stop the node x64lmwbio9g-priv with "sudo service pacemaker stop"
Modify the recovery.conf.pcmk application_name (pgsql_validate_all fails) (It happened once when I forgot to restore the configuration after a hasty pg_rewind)
Start the node x64lmwbio9g-priv with "sudo service pacemaker start"

Full list of resources:

 Resource Group: master-group
     vip-master (ocf::heartbeat:IPaddr2):       Stopped
     vip-rep    (ocf::heartbeat:IPaddr2):       Stopped
 Master/Slave Set: msPostgresql [pgsqlsrv]
     Stopped: [ x64lmwbio9f-priv x64lmwbio9g-priv ]
 Clone Set: fencing [st-null]
     Started: [ x64lmwbio9f-priv x64lmwbio9g-priv ]

I understand that application_name is a prerequisite for the agent to work but i am surprised that a wrong configuration on one node compromises the whole cluster.

I can provide more info if needed.

Benoit.

Side note: with 1.0.0 the vip- group didn't stop in the last case, with 1.0.2 it works as intended which is cool !

recovery.conf in pgdata not data directory.

Hi,

I think we should create the new recovery.conf in the data directory not PGDATA.

https://www.postgresql.org/docs/9.4/static/continuous-archiving.html
Create a recovery command file recovery.conf in the cluster data directory (see Chapter 26).

postgresql.conf

# The default values of these variables are driven from the -D command-line
# option or PGDATA environment variable, represented here as ConfigDir.

#data_directory = 'ConfigDir'           # use data in another directory
                                        # (change requires restart)

Benoit

Add parameter to track highest reached timeline

Hi,

I have a proposal for additional data consistency measure that will help to prevent wrong promote decisions.
The proposal is to create a permanent parameter that will store the highest timeline number that was ever reached in this database cluster. The parameter is saved in post-promote phase and consulted in pre-promote. It will ensure that failed master will never be promoted.

Details:
post-promote: save the new timeline value to the crm_config database. Why crm and not private attr: crm parameter is permanent across reboots/crashes, it is node independent and is consistently reachable from any node within quorum partition. Format: crm_attribute --lifetime forever --type crm_config --name "$name" --update "$val"
pre-promote: get the timeline value of the local database and compare it to the global highest timeline value. If the local timeline is lower than highest global, abort the promotion (set attr to abort).

Why it is needed:

it will ensure that the failed master (or greatly lagging slave) will never be promoted under any circumstances (even with fencing not configured)
it can help to protect data when master vote parameters are inconsistent (e.g. issue #56 or any future inconsistencies)
it is just additional measure that can be helpful and it is not interfering with current voting mechanisms
it is pre-requisite to auto-rewind of failed masters that I'm considering to implement (I've already thinked it up and I'll open a separate issue to discuss it, but in short: if the local timeline of the DB is lower than global highest timeline during the start of a local resource, we have successfully identified a failed master (or greatly lagging slave) that needs a rewind or basebackup)

I'm in half-way of implementing the global timeline check and I've opened this issue to ask if this sounds desirable to you (my aim is to integrate as many changes as possible back into your project).

Jan

pgsql_stop fails on master if postgres fails

If the postgres process on the master instance terminates unexpectedly, stopping the pgsqlms resource fails. Since the resource cannot be stopped, a slave node cannot be promoted. This may be somewhat dependent on how the stop op is configured.

In the code, pgsql_stop calls into pgsql_monitor [1] which returns $OCF_FAILED_MASTER. That value is not one which is explicitly handled, so we drop into the catch all case which is interpreted as a failure.

I think a failed master should be interpreted as a successful stop in this case.

[1] https://github.com/dalibo/PAF/blob/v1.0.0/script/pgsqlms#L861

Check if a secondary is connected before demoting primary

In sub pgsql_demote, we need to make sure that at least one secondary is connected when demoting the primary on a switchover scenario.

If Pacemaker move the master role to another node and demote the old one as a slave, we should raise an error ("soft" failure ?) if the new master-to-be is not up-to-date with the current master.

Critical issue: agent ignore replication status from master

Hi!
I got new issue.
When I testing paf, I randomly reboot nodes and emulate two nodes fails. This meant cluster will stop working, couse only one master is alive and config want master and slave to live.
This is log from one slave:

2016-04-18 10:44:24 MSK LOG:  entering standby mode
2016-04-18 10:44:24 MSK LOG:  consistent recovery state reached at 37B/78000090
2016-04-18 10:44:24 MSK LOG:  record with zero length at 37B/78000090
2016-04-18 10:44:24 MSK LOG:  database system is ready to accept read only connections
2016-04-18 10:44:25 MSK LOG:  started streaming WAL from primary at 37B/78000000 on timeline 13
2016-04-18 10:44:25 MSK FATAL:  could not receive data from WAL stream: ERROR:  requested starting point 37B/78000000 is ahead of the WAL flush position of this server 37B/770AEBB8

2016-04-18 10:44:25 MSK LOG:  incomplete startup packet
2016-04-18 10:44:25 MSK LOG:  started streaming WAL from primary at 37B/78000000 on timeline 13
2016-04-18 10:44:25 MSK FATAL:  could not receive data from WAL stream: ERROR:  requested starting point 37B/78000000 is ahead of the WAL flush position of this server 37B/770AEBB8
.... (looping infinity)

This is log from another slave:

2016-04-18 10:59:45 MSK LOG:  entering standby mode
2016-04-18 10:59:45 MSK FATAL:  requested timeline 13 is not a child of this server's history
2016-04-18 10:59:45 MSK DETAIL:  Latest checkpoint is at 37B/730C6DE0 on timeline 9, but in the history of the requested timeline, the server forked off from that timeline at 37B/730C6DE0.
2016-04-18 10:59:45 MSK LOG:  startup process (PID 1295) exited with exit code 1
2016-04-18 10:59:45 MSK LOG:  aborting startup due to startup process failure

This is master status:

root@a:~# crm_mon -1
Last updated: Mon Apr 18 11:10:49 2016
Last change: Mon Apr 18 10:56:16 2016 via crm_attribute on c.
Stack: corosync
Current DC: a. (1084754433) - partition with quorum
Version: 1.1.12-cdf310a
3 Nodes configured
7 Resources configured


Online: [ a. b. ]
OFFLINE: [ c. ]

 Resource Group: master
     pgsql-master-ip    (ocf::heartbeat:IPaddr2):   Started a. 
 Master/Slave Set: msPostgresql [pgsqld]
     Masters: [ a. ]
     Slaves: [ b. ]
     Stopped: [ c. ]
 Clone Set: WebFarm [apache]
     Started: [ a. b. ]
     Stopped: [ c. ]
root@a:~# sudo -u postgres psql -h 192.168.10.200 -P pager=off -q -A -c "SELECT client_addr,sync_state from pg_stat_replication"
could not change directory to "/root": Отказано в доступе
client_addr|sync_state
(0 rows)
root@a:~#

I was rebooted another slave for restore db, but slave b is not in the replica due by infinity recovery and not marked from paf as failed.

I thinking this is a critical issue. If slave out from master replication (pg_stat_replication), paf will wait some time (role Slave timeout may be) and mark this slave as failed.

Use env vars instead of "crm_node"

In sub pgsql_promote, instead of calling crm_node, use variables available, like OCF_RESKEY_CRM_meta_notify_slave_uname and OCF_RESKEY_CRM_meta_notify_start_uname ?

Perl warning

There's a perl warning in logs when score does not exists yet.

Around line 614, in check_location, the sub _get_master_score return an empty string when score does not exists. This raise the warning on next tests related to $node_score:

_set_master_score( -1, $row->[0] ) if $node_score != -1;
[...]
if ( $node_score != $row->[1] ) {

Same issue than in e4d98a4.

Update documentation

We need to update the documentation before releasing 2.0.

how the master is chosen during the very first cluster startup
how PAF is checking a switchover is safe
check the Installation manual
check the Configuration manual
check the Administration manual
update the link to the RPM/DEB file everywhere

psql: could not connect to server: No such file or directory

Hello

We don't use default parameter for our database.
So we have indicated system_user, pgdata, pghost on our resource

All pg_xxx commands works but the psql into _confirm_role function doesn't work. It seems the parameters aren't used (see below).

# pcs resource debug-start pgsqlms

Operation start for pgsqlms:0 (ocf:heartbeat:pgsqlms) returned 0
 >  stdout: /DBTEST/tmp:32100 - no response
 >  stdout: pg_ctl: no server running
 >  stdout: waiting for server to start....2016-05-25 15:16:00 CEST [28545]: [1-1] user=,db= LOG:  redirecting log output to logging collector process
 >  stdout: 2016-05-25 15:16:00 CEST [28545]: [2-1] user=,db= HINT:  Future log output will appear in directory "/DBTEST/log/tech".
 >  stdout:  done
 >  stdout: server started
 >  stdout: /DBTEST/tmp:32100 - accepting connections
 >  stderr: Use of uninitialized value $postgres_gid in concatenation (.) or string at /usr/lib/ocf/resource.d/heartbeat/pgsqlms line 88.
 >  stderr: Use of uninitialized value $postgres_gid in concatenation (.) or string at /usr/lib/ocf/resource.d/heartbeat/pgsqlms line 88.
 >  stderr: pgsqlms(pgsqlms:0)[28530]: 2016/05/25_15:16:00  DEBUG: _runas: launching as "dbtest" command "/usr/pgsql-9.4/bin/pg_isready -h /DBTEST/tmp -p 32100"
 >  stderr: Use of uninitialized value $( in scalar assignment at /usr/lib/ocf/resource.d/heartbeat/pgsqlms line 92.
 >  stderr: pgsqlms(pgsqlms:0)[28530]: 2016/05/25_15:16:00  DEBUG: pgsql_monitor: instance "pgsqlms:0" is not listening
 >  stderr: Use of uninitialized value $postgres_gid in concatenation (.) or string at /usr/lib/ocf/resource.d/heartbeat/pgsqlms line 88.
 >  stderr: pgsqlms(pgsqlms:0)[28530]: 2016/05/25_15:16:00  DEBUG: _runas: launching as "dbtest" command "/usr/pgsql-9.4/bin/pg_ctl -D /DBTEST/base/system status"
 >  stderr: Use of uninitialized value $postgres_gid in concatenation (.) or string at /usr/lib/ocf/resource.d/heartbeat/pgsqlms line 88.
 >  stderr: Use of uninitialized value $( in scalar assignment at /usr/lib/ocf/resource.d/heartbeat/pgsqlms line 92.
 >  stderr: pgsqlms(pgsqlms:0)[28530]: 2016/05/25_15:16:00  DEBUG: _confirm_stopped: no postmaster process found for instance "pgsqlms:0"
 >  stderr: pgsqlms(pgsqlms:0)[28530]: 2016/05/25_15:16:00  DEBUG: _controldata: instance "pgsqlms:0" state is "shut down in recovery"
 >  stderr: pgsqlms(pgsqlms:0)[28530]: 2016/05/25_15:16:00  DEBUG: _confirm_stopped: instance "pgsqlms:0" controldata indicates that the instance was propertly shut down
 >  stderr: pgsqlms(pgsqlms:0)[28530]: 2016/05/25_15:16:00  DEBUG: pgsql_start: instance "pgsqlms:0" is not running, starting it as a secondary
 >  stderr: pgsqlms(pgsqlms:0)[28530]: 2016/05/25_15:16:00  DEBUG: _create_recovery_conf: get replication configuration from the template file "/DBTEST/recovery.conf"
 >  stderr: pgsqlms(pgsqlms:0)[28530]: 2016/05/25_15:16:00  DEBUG: _create_recovery_conf: write the replication configuration to "/DBTEST/base/system/recovery.conf" file
 >  stderr: pgsqlms(pgsqlms:0)[28530]: 2016/05/25_15:16:00  DEBUG: _runas: launching as "dbtest" command "/usr/pgsql-9.4/bin/pg_ctl -D /DBTEST/base/system -w start"
 >  stderr: Use of uninitialized value $postgres_gid in concatenation (.) or string at /usr/lib/ocf/resource.d/heartbeat/pgsqlms line 88.
 >  stderr: Use of uninitialized value $postgres_gid in concatenation (.) or string at /usr/lib/ocf/resource.d/heartbeat/pgsqlms line 88.
 >  stderr: Use of uninitialized value $( in scalar assignment at /usr/lib/ocf/resource.d/heartbeat/pgsqlms line 92.
 >  stderr: Use of uninitialized value $postgres_gid in concatenation (.) or string at /usr/lib/ocf/resource.d/heartbeat/pgsqlms line 88.
 >  stderr: pgsqlms(pgsqlms:0)[28530]: 2016/05/25_15:16:01  DEBUG: _runas: launching as "dbtest" command "/usr/pgsql-9.4/bin/pg_isready -h /DBTEST/tmp -p 32100"
 >  stderr: Use of uninitialized value $( in scalar assignment at /usr/lib/ocf/resource.d/heartbeat/pgsqlms line 92.
 >  stderr: pgsqlms(pgsqlms:0)[4827]: 2016/05/25_15:16:05  DEBUG: pgsql_monitor: instance "pgsqlms:0" is listening

 >  stderr: psql: could not connect to server: No such file or directory
 >  stderr:     Is the server running locally and accepting
 >  stderr:     connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?

 >  stderr: pgsqlms(pgsqlms:0)[4827]: 2016/05/25_15:16:05  DEBUG: _query: psql return code: 2
 >  stderr: pgsqlms(pgsqlms:0)[4827]: 2016/05/25_15:16:05  DEBUG: _query: @res:$VAR1 = [];
 >  stderr:
 >  stderr: 2016/05/25_15:16:05  ERROR: _confirm_role: psql could not connect to instance "pgsqlms:0"
 >  stderr: 2016/05/25_15:16:05  ERROR: pgsql_start: unexpected state for instance "pgsqlms:0" (returned 1)

Best regards,

CentOS Documentation Corosync Configuration

Hi, in documentations (http://dalibo.github.io/PAF/documentation.html) there is not any "/etc/corosync/corosync.conf" configuration for CentOS 6 and CentOS 7 but Debian 8 has it. Is it forgotten or on purpose?

http://dalibo.github.io/PAF/Quick_Start-Debian-8.html
http://dalibo.github.io/PAF/Quick_Start-CentOS-6.html
http://dalibo.github.io/PAF/Quick_Start-CentOS-7.html

How to set timeout in restart role Master?

Hi!
I have a very large DB with diffirent countrys. Due by bandwith limitations every operation have timeout 2mins.
I have issue with very fast Master switching. I need have option to set timeout for switch Master.
Standart ocf was supported this option:
op monitor interval=30s on-fail=restart role=Master timeout=160s
PAF does not support it. Can you suggest me something?

promotion fails if LSN cannot be determined for other nodes

During promotion, the program attempts to determine if this node has the highest LSN of all members in the partition. If the LSN of any node cannot be determined an error is returned and promotion fails.

This can be problematic in two scenarios:

Postgres failed on the old master but the node is still running because there is no STONITH.
There is a node in the partition which is not running postgres.

In either case, retrieving the LSN from the node will fail and the promotion will fail.

If we can limit the check to nodes that have healthy postgres instances, that would be ideal. Otherwise it might make sense just to log this as a notice/warn and move onto the next node.

pgsqld_monitor_0 on <hostname> 'not configured'

Hi,

I followed the guide and documentation, but I keep being unable to run the HA postgresql on Centos 7.2

What can be the cause of:

pgsqld_monitor_0 on database1 'not configured' (6): call=18, status=complete, exitreason='none',
last-rc-change='Tue Nov 8 14:02:39 2016', queued=0ms, exec=148ms

My ocf:heartbeat:pgsqlms config:

pgsqld

pcs -f cluster1.xml resource create pgsqld ocf:heartbeat:pgsqlms
bindir=/usr/pgsql-9.6/bin pgdata=/var/lib/pgsql/9.6/data
op start timeout=60s
op stop timeout=60s
op promote timeout=30s
op demote timeout=120s
op monitor interval=15s timeout=10s role="Master"
op monitor interval=16s timeout=10s role="Slave"
op notify timeout=60s \

My "pcs status" output:

Cluster name: cluster_pgsql
Last updated: Tue Nov 8 14:02:44 2016 Last change: Tue Nov 8 14:02:37 2016 by root via cibadmin on database1
Stack: corosync
Current DC: database1 (version 1.1.13-10.el7_2.4-44eb2dd) - partition with quorum
3 nodes and 7 resources configured

Online: [ database1 database2 database3 ]

Full list of resources:

fence_vm_database1 (stonith:fence_rhevm): Started database2
fence_vm_database2 (stonith:fence_rhevm): Stopped
fence_vm_database3 (stonith:fence_rhevm): Stopped
Master/Slave Set: psql-ha [pgsqld]
pgsqld (ocf::heartbeat:pgsqlms): FAILED database1
pgsqld (ocf::heartbeat:pgsqlms): FAILED database2
pgsqld (ocf::heartbeat:pgsqlms): FAILED database3
pgsql-master-ip (ocf::heartbeat:IPaddr2): Stopped

Failed Actions:

pgsqld_monitor_0 on database1 'not configured' (6): call=18, status=complete, exitreason='none',
last-rc-change='Tue Nov 8 14:02:39 2016', queued=0ms, exec=148ms
pgsqld_monitor_0 on database2 'not configured' (6): call=18, status=complete, exitreason='none',
last-rc-change='Tue Nov 8 14:02:40 2016', queued=0ms, exec=297ms
pgsqld_monitor_0 on database3 'not configured' (6): call=18, status=complete, exitreason='none',
last-rc-change='Tue Nov 8 14:02:40 2016', queued=0ms, exec=304ms

PCSD Status:
database1: Online
database2: Online
database3: Online

Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled

Pacemaker resource timeout

In subs pgsql_demote and pgsql_stop, we set the pg_ctl stop timeout to ten minutes, in order to let the Pacemaker timeout kick in first.
It would be preferable to get the resource action timeout if possible, so we can be sure we set the timeout to a higher value.

Release 2.0 beta1

To be able to update the documentation, we need to create a beta version, publish packages and make all required tests on virtualized 3-node clusters using this packages.

support controlled switchover for v1.x

Currently, branch v1.x does not support controlled switchover.

The lack of private attributes for stack using corosync 1.x or pacemaker < 1.13 prevent us to backpatch the v2.x of this feature to v1.x.

A recent discussion with pacemaker devs raised two possible tricks to deal with attributes outside of the CIB in the cluster level:

try to use attrd_updater with a very large damping value
try to set the attribute to a not existing node

See: https://www.mail-archive.com/[email protected]/msg03683.html

As v1.x is soon in maintenance only dev, I don't set a milestone on this issue. Not sure we'll work on this issue, but at least I want to keep track of it for now.

pgsqld failed in pacemaker with unknown error

Hi
I try run cluster with PAF and received this error:
pgsqld_start_0 on srvc 'unknown error' (1): call=541, status=complete, exit-reason='none', last-rc-change='Mon Apr 4 12:33:16 2016', queued=0ms, exec=5128ms
Same errors on srva and srvb

Postgresql in not running and no running attempts in log files.
How I can debug PAF and show reason of error?

clusterlabs / paf Goto Github PK

paf's People

Contributors

Stargazers

Watchers

Forkers

paf's Issues

pgsqld

pgsql-ha

pgsqld

Recommend Projects

Recommend Topics

Recommend Org