Wrong IP added to pg_hba.conf when multiple interfaces exist.

In my specific case, the container has two interfaces. The IP from interface 1 was added to the pg_hba.conf - however connections seem to it from another container on the same network come from the secondary interface.

Oddly enough, the original "registration attempt" also is performed over the same network interface, so I'm confused as to why it decided to take the IP from the primary interface.

As an aside, there are many issues/inconveniences in the area of DNS resolution / pg_hba entries. For instance, given a setup of 3 containers: monitor, db1, db2: db1 and db2 register itself with monitor, db2 fails ot perform a pg_basebackup from db1 since db1 has 'db2' added in it's pg_hba.conf -- however somehow pg_basebackup tells me it can't resolve 'db1' to an IP whereas ping, dig, host have no problems in doing so.

Register an existing standby node

At the moment our command pg_autoctl create postgres knows how to register a pre-existing PostgreSQL instance when it's a primary server. This allows users to provision as usual and only then register to the monitor.

There are two main cases when registering an existing standby:

we only have a pre-existing PGDATA directory with a base backup copy ready for recovery and connection
the standby is already running and in recovery

In the first case we should check that the system_identifier is the same as in the primary server and then continue with the standby initialisation as usual, only skipping the pg_basebackup step.

In the second case, we should still check that the system_identifier is the same as in the primary server, and then verify that the setup is done according to pg_auto_failover expectations:

list all the setup that we do when creating a standby (replication slot, primary_conninfo, password, HBA, PostgreSQL configuration, pg_stat_wal_receiver, etc)
either implement a way to check that everything is in place in a way that allows us to implement a failover, or override the existing settings with the pg_auto_failover settings for a standby and restart the standby
register the standby to the monitor either as WAIT_STANDBY (if the primary is not ready yet, which is visible from the standby by the lack of an entry in pg_stat_wal_receiver) or in CATCHINGUP state

This is a brain dump, the actual implementation might differ to a large extend with the thinking here, though I think this is the general idea we should pursue here.

Release 1.0.0

These instructions assume you have $VERSION, $PROJECT, and $REPO environment variables set in your shell (e.g. 1.0.0, pgautofailover, and pg_auto_failover). With those set, code from most steps can be copy-pasted.

Prepare Project

Ensure all needed changes are in the relevant release-x.y branch. git log --cherry-pick --no-merges release-x.y...master can be helpful. Be sure to cherry-pick changes in the same order they were merged to the main branch (but do not cherry-pick merge commits themselves)
Add a CHANGELOG entry in the ~~master~~ release-x.y branch summarizing meaningful changes ( will merge release branch into master after releasing to update the changelog there )
Get complete approval for the commit you're tagging before creating or pushing any tags. Tags should be immutable, so do not proceed to the next step until you're sure you have everything you want in your release branch
Use git tag -a -s v$VERSION to create an annotated, signed tag for the release. Summarize the release in the one-line tag annotation (beneath 52 characters). Push the tag with git push origin v$VERSION
Visit the project's releases page (e.g. open https://github.com/citusdata/$REPO/releases)
- Create a new release object for your git tag (i.e. v$VERSION). Leave the description blank (it will auto-fill with the tag description)

As for signing tags, use your own key, and ensure it's known to GitHub.

Update OS Packages

BUG: Failed to parse pg_controldata output

[postgres@test1 ~]$ pg_autoctl create monitor --pgdata ./pgdata
10:59:54 INFO  Initialising a PostgreSQL cluster at "./pgdata"
10:59:54 INFO  Now using absolute pgdata value "/home/postgres/pgdata" in the configuration
10:59:54 INFO   /usr/pgsql-11/bin/pg_ctl --pgdata ./pgdata --options "-p 5432" --options "-h *" --wait start
10:59:54 ERROR Failed to parse pg_controldata output
10:59:54 FATAL Failed to discover PostgreSQL setup, please fix previous errors.
10:59:54 FATAL Failed to initialise a monitor node, see above for details

Shouldn't this be synchronous replication

Hi,

I just followed the README to setup the three nodes with PostgreSQL 12:

postgres@pg-af1:/u02/pgdata/12/PG1/ [PG1] pg_autoctl show state
Name |   Port | Group |  Node |     Current State |    Assigned State
---------------------------+--------+-------+-------+-------------------+------------------
pg-af1.it.dbi-services.com |   5432 |     0 |     1 |      wait_primary |      wait_primary
pg-af2.it.dbi-services.com |   5432 |     0 |     2 |        catchingup |        catchingup

What does the "catchingup" state really mean? I'd expected something like "in sync". When checking synchronous_standby_names I'd expected to see synchronous replication which apparently is not the case:

postgres@pg-af1:/u02/pgdata/12/PG1/ [PG1] sq
psql (12.0 dbi services build)
Type "help" for help.

postgres=# show synchronous_standby_names ;
synchronous_standby_names
---------------------------

(1 row)

Am I missing something? Replication in general is fine:

postgres=# select * from pg_stat_replication ;
-[ RECORD 1 ]----+------------------------------
pid              | 8336
usesysid         | 16385
usename          | pgautofailover_replicator
application_name | walreceiver
client_addr      | 192.168.22.71
client_hostname  | pg-af2.it.dbi-services.com
client_port      | 48680
backend_start    | 2019-10-19 15:39:00.508099+02
backend_xmin     |
state            | streaming
sent_lsn         | 0/30017A0
write_lsn        | 0/30017A0
flush_lsn        | 0/30017A0
replay_lsn       | 0/30017A0
write_lag        |
flush_lag        |
replay_lag       |
sync_priority    | 0
sync_state       | async
reply_time       | 2019-10-19 15:49:16.594248+02

Thanks
Daniel

Feature Request: skip management of pg_hba.conf

I'm using Ansible to manage my PostgreSQL nodes and have to account for the line added by pg_auto_failover to pg_hba.conf in my Ansible config.

The line is also much more permissive than I'd prefer to allow.

Having an option that is along the lines of "--skip-pg_hba" that lets me manage that when using "create monitor" and "create postgres" would be ideal.

Change postgresql.conf settings on primary and standby via pg_autoctl

It would be useful if pg_autoctl had a command to change a setting like max_connections or shared_buffers when streaming replication is set up. We basically need to make sure that both servers have the setting in their postgresql.conf (or include) and then do a restart, preferably without triggering a failover.

about the func findHostnameLocalAddress addr protocol judgment

ifa_addr is empty in the pptp protocol,"ifaddr->ifa_addr->sa_family == AF_INET" causes a segmentation fault.

point in time backups

It'd be nice to have recommendations on how to do backups:

is it okay to edit pg_hba.conf of the primary node?
what's the best way to change wal replication for backups when the primary node changes?

"ERROR: permission denied for table node" using failover command

psql postgres://autoctl_node@monitor/pg_auto_failover
psql (11.3 (Ubuntu 11.3-1.pgdg18.04+1))
Type "help" for help.

pg_auto_failover=> select pgautofailover.perform_failover(formation_id => 'default', group_id => 0);
ERROR:  permission denied for table node
CONTEXT:  SQL statement "UPDATE pgautofailover.node SET goalstate = $1, statechangetime = now() WHERE nodename = $2 AND nodeport = $3"
pg_auto_failover=>

pg_autoctl failed to create monitor on mac

I was unable to create monitor using 1.0.4 release with the following command

pg_autoctl create monitor --pgdata ${WORK_DIR}/monitor --pgport 6000 --nodename localhost

It failed with

 LOG:  syntax error in file "/Users/mtuncer/dev/ha/test/monitor/postgresql-auto-failover.conf" line 4, near end of line
 FATAL:  configuration file "/Users/mtuncer/dev/ha/test/monitor/postgresql.conf" contains errors

when I checked postgresql-auto-failover.conf failure was port setting file with no value set.

port =

I checked monitor/pg_autoctl.cfg that is stored in temporary cache. Port setting was correctly applied. We just failed to move that setting into postgresql-auto-failover.conf

I tested this with ubuntu, it was fine.
I also testhed v1.0.3 with mac which is also fine. It must be something we introduced along 1.0.4

ERROR: duplicate key value violates unique constraint "node_nodename_nodeport_key"

I'm running the monitor and the master in two Docker container on a container orchestrator. The first deployment works great. The master registers itself on the monitor node. But as soon as I kill the master node container and start it again, it can't register itself anymore and gives me this error.

16:47:05 ERROR Failed to execute "SELECT * FROM pgautofailover.register_node($1, $2, $3, $4, $5, $6::pgautofailover.replication_state, $7)": ERROR:  duplicate key value violates unique constraint "node_nodename_nodeport_key"
DETAIL:  Key (nodename, nodeport)=(_mymasternodename_, 5432) already exists.
CONTEXT:  SQL statement "INSERT INTO pgautofailover.node (formationid, groupid, nodename, nodeport, goalstate, reportedstate) VALUES ($1, $2, $3, $4, $5, $6) RETURNING nodeid"

I start my node with the following command :
su postgres -c 'pg_autoctl create postgres --pgctl /usr/lib/postgresql/11/bin/pg_ctl --pgdata /home/postgres/data --monitor postgres://autoctl_node@_mymonitornodename_:5432/pg_auto_failover --nodename _mymasternodename_'

Fail to transition to state "catching" up after a failover

Hello,

On Ubuntu 18 and Postgresql 11, I'm systematically unable to automatically perform a failover and have the old primary become the secondary.

For example, here I stopped Postgresql and pg_auto_failover. The secondary was then promoted. When I turn back on the old primary, it isn't able to become a secondary.

Oct 04 17:14:53 pg-test-2 systemd[1]: Started pg_auto_failover.
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO  Managing PostgreSQL installation at "/postgresql/11/main"
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO  Found a stale pidfile at "/tmp/pg_autoctl/postgresql/11/main/pg_autoctl.pid"
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 WARN  Removing the stale pid file "/tmp/pg_autoctl/postgresql/11/main/pg_autoctl.pid"
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO  The version of extenstion "pgautofailover" is "1.0" on the monitor
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO  pg_autoctl service is starting
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO  Calling node_active for node default/13/0 with current state: primary, PostgreSQL is not running, sync_state is "", WAL delta is -1.
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO  Postgres is not running, starting postgres
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO   /usr/lib/postgresql/11/bin/pg_ctl --pgdata /postgresql/11/main --options "-p 5432" --options "-h *" --wait start
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 WARN  PostgreSQL was not running, restarted with pid 2122
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO  FSM transition from "primary" to "demoted": A failover occurred, no longer primary
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO  Transition complete: current state is now "demoted"
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO  Calling node_active for node default/13/0 with current state: demoted, PostgreSQL is not running, sync_state is "", WAL delta is -1.
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO  FSM transition from "demoted" to "catchingup": A new primary is available. First, try to rewind. If that fails, do a pg_basebackup.
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO  The primary node returned by the monitor is pg-test-1.c.hunter-io.internal:5432
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO  Rewinding PostgreSQL to follow new primary pg-test-1.c.hunter-io.internal:5432
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 ERROR Connection to database failed: could not connect to server: No such file or directory
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]:         Is the server running locally and accepting
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]:         connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 ERROR Failed to get the postgresql.conf path from the local postgres server, see above for details
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 WARN  Failed to rewind demoted primary to standby, trying pg_basebackup instead
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO  Initialising PostgreSQL as a hot standby
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO  Target directory exists: "/postgresql/11/main", stopping PostgreSQL
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO  pg_ctl: no server running
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO  pg_ctl stop failed, but PostgreSQL is not running anyway
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO  Running /usr/lib/postgresql/11/bin/pg_basebackup -w -h pg-test-1.c.hunter-io.internal -p 5432 --pgdata /postgresql/11/backup -U pgautofailover_replicator --write-recovery-conf --max-rate 100M --wal-method=stream --slot pgautofailover_standby ...
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 17:14:56 INFO  pg_basebackup: initiating base backup, waiting for checkpoint to complete
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: pg_basebackup: checkpoint completed
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: pg_basebackup: write-ahead log start point: 0/53000028 on timeline 3
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: pg_basebackup: starting background WAL receiver
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]:     0/23816 kB (0%), 0/1 tablespace (/postgresql/11/backup/backup_label )
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 10345/23816 kB (43%), 0/1 tablespace (...tgresql/11/backup/base/13090/827)
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 23826/23826 kB (100%), 0/1 tablespace (...esql/11/backup/global/pg_control)
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 23826/23826 kB (100%), 1/1 tablespace
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: pg_basebackup: write-ahead log end point: 0/530000F8
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: pg_basebackup: waiting for background process to finish streaming ...
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: pg_basebackup: base backup completed
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 17:14:56 INFO  Postgres is not running, starting postgres
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 17:14:56 INFO   /usr/lib/postgresql/11/bin/pg_ctl --pgdata /postgresql/11/main --options "-p 5432" --options "-h *" --wait start
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 17:14:56 ERROR Failed to start PostgreSQL. pg_ctl start returned: 1
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 17:14:56 ERROR Failed to become standby server, see above for details
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 17:14:56 ERROR Failed to transition from state "demoted" to state "catchingup", see above.
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 17:14:56 ERROR Failed to transition to state "catchingup", retrying...

This problem can manually be fixed by removing the node an adding it back, but of course, that's not ideal :)

Any help on how to fix this is welcome! Thanks.

choose node manually

Is it possible to manually choose the node that will become primary or wait_primary?
thanks

Fix length string buffers need to be converted to dynamic buffers in client code

We have many uses of char buffer[BUFFER_LENGTH] and strcpy, sprintf to that buffer.
Although we haven't seen a problem yet. Using fixed size buffer to host dynamic length content and having unchecked access to these buffers is not safe.

We should use PQExpBuffer and appendPQExpBufferStr() functions to modify that buffer.

how to reload pg_hba.conf or restart postgresql ?

I already create a 3 nodes cluster and get pg_auto_failover.
but when I try to connect via URI , there's messages complain about pg_hba.conf

postgres@pg-slave:~$ psql -d postgres://pg-slave:5432,pg-primary:5432/postgres?target_session_attrs=read-write
psql: FATAL:  no pg_hba.conf entry for host "192.168.11.153", user "postgres", database "postgres", SSL off
postgres@pg-slave:~$

so I modify /var/lib/postgresql/11/main/pg_hba.conf (my pgdata path) using the command below

echo "host all postgres 192.168.11.0/24 trust" >> /var/lib/postgresql/11/main/pg_hba.conf
tail -5 /var/lib/postgresql/11/main/pg_hba.conf

and the pg_hba.conf looks like

host "postgres" "postgres" pg-primary trust # Auto-generated by pg_auto_failover
host all "pgautofailover_monitor" monitor trust # Auto-generated by pg_auto_failover
host replication "pgautofailover_replicator" pg-slave trust # Auto-generated by pg_auto_failover
host "postgres" "pgautofailover_replicator" pg-slave trust # Auto-generated by pg_auto_failover
host	all	postgres	192.168.11.0/24	trust

ok , now , in normal situation , I have to restart postgresql service to make new pg_hba.conf works
but I have no idea how to do so with pg_auto_failover .
there is a systemd config with pg_auto_failover , but looks like not iomplement into OS ? (or it did , but I find nothing related in /etc/systemd/system/multi-user.target.wants)

so , here is my question , how do I reload/restart postgresql to make new pg_hba.conf works ?
I already try pg_autoctl reload

2019-09-19 07:51:58 [administrator@pg-slave ~]$ sudo runuser -l postgres -c "pg_autoctl reload --pgdata /var/lib/postgresql/11/main"
07:52:15 INFO  Managing PostgreSQL installation at "/var/lib/postgresql/11/main"
2019-09-19 07:52:15 [administrator@pg-slave ~]$ sudo runuser -l postgres -c "psql -d postgres://pg-slave:5432,pg-primary:5432/postgres?target_session_attrs=read-write"
psql: FATAL:  no pg_hba.conf entry for host "192.168.11.153", user "postgres", database "postgres", SSL off
2019-09-19 07:52:24 [administrator@pg-slave ~]$

stop the keeper and run again

2019-09-19 07:53:22 [administrator@pg-slave ~]$ sudo runuser -l postgres -c "pg_autoctl run --pgdata /var/lib/postgresql/11/main &"
2019-09-19 07:53:32 [administrator@pg-slave ~]$ 07:53:32 INFO  Managing PostgreSQL installation at "/var/lib/postgresql/11/main"
07:53:32 INFO  Found a stale pidfile at "/tmp/pg_autoctl/var/lib/postgresql/11/main/pg_autoctl.pid"
07:53:32 WARN  Removing the stale pid file "/tmp/pg_autoctl/var/lib/postgresql/11/main/pg_autoctl.pid"
07:53:32 INFO  The version of extenstion "pgautofailover" is "1.0" on the monitor
07:53:32 INFO  pg_autoctl service is starting
07:53:32 INFO  Calling node_active for node default/2/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.

2019-09-19 07:53:34 [administrator@pg-slave ~]$ sudo runuser -l postgres -c "psql -d postgres://pg-slave:5432,pg-primary:5432/postgres?target_session_attrs=read-write"07:53:37 INFO  Calling node_active for node default/2/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.

psql: FATAL:  no pg_hba.conf entry for host "192.168.11.153", user "postgres", database "postgres", SSL off
.

please , any suggestions ?

using with citus data extension

Hi,

I would like to know if this extension is compatible with citus data extension. As far as I know, the citus extension only provide distribution of the data, and query, but not auto failover.

If it is compatible, is there a writeup somewhere I can learn how to do it?

Thank you

re-adding nodes after losing the monitor

Hello!
First of all thanks for the effort you guys put into this, also kudos for using pg_rewind, awesome stuff !

While testing, i came across the following scenario : say that you lose the monitor some reason (in my test i just deleted the container). Replication keeps working, no issues there, but when i bring a new monitor up, i can add the master by removing the state file and issuing a "pg_autoctl create postgres", but when i try to add the slave, i get :
" 08:52:55 ERROR pg_autoctl doesn't know how to register an already existing standby server at the moment ". Of course, i can add it by rebuilding the slave but i was wondering if theres an alternative way doing this that won't require slave rebuild.

Add notification when creating standby while primary is not running yet

If the primary has been created but pg_autoctl is not running, then pg_autoctl create on the standby will get stuck at:

03:33:33 INFO  FSM transition from "init" to "wait_standby": Start following a primary
03:33:33 INFO  Transition complete: current state is now "wait_standby"

It would be nice if we gave a hint to the user that pg_autoctl run needs to be running at the primary at this point.

slave node can't run pg_basebackup

slave/standby node postgres file lost, pg_autoctl can't startup postgres, pg_autoctl can't auto run pg_basebackup to rebuild it.

release 1.0.5

These instructions assume you have $VERSION, $PROJECT, and $REPO environment variables set in your shell (e.g. 1.0.5, pgautofailover, and pg_auto_failover). With those set, code from most steps can be copy-pasted.

Prepare Project

Ensure all needed changes are in the relevant release-x.y branch. git log --cherry-pick --no-merges release-x.y...master can be helpful. Be sure to cherry-pick changes in the same order they were merged to the main branch (but do not cherry-pick merge commits themselves)
Add a CHANGELOG entry in the ~~master~~ release-x.y branch summarizing meaningful changes ( will merge release branch into master after releasing to update the changelog there )
Update version number in defaults.h
Get complete approval for the commit you're tagging before creating or pushing any tags. Tags should be immutable, so do not proceed to the next step until you're sure you have everything you want in your release branch
Use git tag -a -s v$VERSION to create an annotated, signed tag for the release. Summarize the release in the one-line tag annotation (beneath 52 characters). Push the tag with git push origin v$VERSION
Visit the project's releases page (e.g. open https://github.com/citusdata/$REPO/releases)
- Create a new release object for your git tag (i.e. v$VERSION). Leave the description blank (it will auto-fill with the tag description)

As for signing tags, use your own key, and ensure it's known to GitHub.

Update OS Packages

Support mounted pgdata folder

When executing create postgres on secondary node it fails due to mounted PGDATA folder.

ERROR Failed to remove directory "/opt/postgressql/pgdbdata": Device or resource busy

Full log:

ubuntu@ip-10-61-5-136:~$ sudo -u postgres -i pg_autoctl create postgres  --pgdata /opt/postgressql/pgdbdata --monitor postgres://autoctl_node@MONITOR_NODE:5432/pg_auto_failover --nodename SECONDARY_NODE
15:13:01 INFO  Registered node SECONDARY_NODE with id 16 in formation "default", group 0.
15:13:01 INFO  Writing keeper init state file at "/var/lib/postgresql/.local/share/pg_autoctl/opt/postgressql/pgdbdata/pg_autoctl.init"
15:13:01 INFO  Successfully registered as "wait_standby" to the monitor.
15:13:01 INFO  FSM transition from "init" to "wait_standby": Start following a primary
15:13:01 INFO  Transition complete: current state is now "wait_standby"
15:13:06 INFO  FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby
15:13:06 INFO  The primary node returned by the monitor is PRIMARY_NODE:5432
15:13:06 INFO  Initialising PostgreSQL as a hot standby
15:13:06 INFO  Target directory exists: "/opt/postgressql/pgdbdata", stopping PostgreSQL
15:13:06 INFO  pg_ctl: no server running

15:13:06 INFO  pg_ctl stop failed, but PostgreSQL is not running anyway
15:13:09 INFO  Running /usr/lib/postgresql/11/bin/pg_basebackup -w -h PRIMARY_NODE -p 5432 --pgdata /opt/postgressql/backup -U pgautofailover_replicator --write-recovery-conf --max-rate 100M --wal-method=stream --slot pgautofailover_standby ...
15:18:27 INFO  pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: CB/6D000028 on timeline 3
pg_basebackup: starting background WAL receiver
       0/30758969 kB (0%), 0/1 tablespace (.../postgressql/backup/backup_label)
....
30759029/30759029 kB (100%), 0/1 tablespace (...gressql/backup/global/pg_control)
30759029/30759029 kB (100%), 1/1 tablespace
pg_basebackup: write-ahead log end point: CB/6D01E8C8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: base backup completed

could not remove file or directory "/opt/postgressql/pgdbdata": Device or resource busy
15:18:27 ERROR Failed to remove directory "/opt/postgressql/pgdbdata": Device or resource busy
15:18:27 ERROR Failed initialise standby server, see above for details
15:18:27 ERROR Failed to transition from state "wait_standby" to state "catchingup", see above.

dual-DC/AZ setup

I'm wondering whether pg_auto_failover could be deployed across two datacenters or availability zones, provided they are close enough to still allow for sync rep.

AFAICT, pg_auto_failover allows only one standby per group, correct? Otherwise, one could maybe deploy the monitor and the primary in different DCs and have one standby in each. This way, the standby local to the monitor could be promoted if the other DC is down, or replication would just go on with the monitor DC down.

If one would deploy the standby in the same DC as the monitor, and that one would go down, the primary would be all alone. Right now, it seems to stay in sync rep (i.e. is blocked) until the other DC is up again, even though it knows that both its standby and the monitor are gone:

17:40:26 ERROR Failed to get node state for node 8 (10.0.3.140:55432) in group 0 of formation "default" with initial state "primary", replication state "", and replication lag -1, see previous lines for details
17:40:26 ERROR Failed to get the goal state from the monitor
17:40:26 WARN  Checking for network partitions...
17:40:31 ERROR PostgreSQL primary server has lost track of its standby: pg_stat_replication reports no client using the slot "pgautofailover_standby".

Would it be possible for the primary to march on and clear synchronous_standby_names, or would this be an architecture violation (not going through the FSM or so)? Any other suggestions on how to setup pg_auto_failover across two DCs?

pg_autoctl missing man page

We are getting packaging warnings about missing man page.
We should consider adding it.

Release 1.0.2

These instructions assume you have $VERSION, $PROJECT, and $REPO environment variables set in your shell (e.g. 1.0.0, pgautofailover, and pg_auto_failover). With those set, code from most steps can be copy-pasted.

Prepare Project

Ensure all needed changes are in the relevant release-x.y branch. git log --cherry-pick --no-merges release-x.y...master can be helpful. Be sure to cherry-pick changes in the same order they were merged to the main branch (but do not cherry-pick merge commits themselves)
Add a CHANGELOG entry in the ~~master~~ release-x.y branch summarizing meaningful changes ( will merge release branch into master after releasing to update the changelog there )
Update version number in defaults.h
Get complete approval for the commit you're tagging before creating or pushing any tags. Tags should be immutable, so do not proceed to the next step until you're sure you have everything you want in your release branch
Use git tag -a -s v$VERSION to create an annotated, signed tag for the release. Summarize the release in the one-line tag annotation (beneath 52 characters). Push the tag with git push origin v$VERSION
Visit the project's releases page (e.g. open https://github.com/citusdata/$REPO/releases)
- Create a new release object for your git tag (i.e. v$VERSION). Leave the description blank (it will auto-fill with the tag description)

As for signing tags, use your own key, and ensure it's known to GitHub.

Update OS Packages

Debian Buster support

Detected operating system as debian/buster.
Unfortunately, your operating system distribution and version are not supported by this script.

Since it's out for a few months now, it would be great if packages were built :)

monitor does not restart when running with pg_autoctl run

If the postgres of the monitor gets killed or crashes it is not correctly restarted by the 'keeper'.
This can be reproduced in 2 sessions:

Session 1

pg_autoctl run --pgdata /path/of/monitor

Session 2

kill <pid of monitor postgres>

(or a killall postgres for that matter)

pg_autoctl will be in an error loop but not restarting the postgres process. This is different behaviour than we have when running a postgres node.

pg_autoctl enable|disable maintenance

It would be very convenient for me to be able to put a node into maintenance mode from that node's machine. Main use case is preparation for a minor PG upgrade.

packages for postgresql 12 ?

I am trying to upgrade postgresql to postgresql-12 , but from the citusdata apt repository , there is only postgresql-11-auto-failover-1.0 , and with the package , postgresql-11 will be installed.
so there are two versions of postggresql in one node

2019-10-23 07:01:07 [administrator@pg_primary ~]$ sudo apt list --installed|grep postgresql

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

postgresql-11/bionic-pgdg,now 11.5-3.pgdg18.04+1 amd64 [installed,automatic]
postgresql-11-auto-failover-1.0/bionic,now 1.0.5 amd64 [installed]
postgresql-12/bionic-pgdg,now 12.0-2.pgdg18.04+1 amd64 [installed]
postgresql-client-11/bionic-pgdg,now 11.5-3.pgdg18.04+1 amd64 [installed,automatic]
postgresql-client-12/bionic-pgdg,now 12.0-2.pgdg18.04+1 amd64 [installed]
postgresql-client-common/bionic-pgdg,now 207.pgdg18.04+1 all [installed,automatic]
postgresql-common/bionic-pgdg,now 207.pgdg18.04+1 all [installed,automatic]

even I run the postgresql only in version 12.

2019-10-23 07:08:59 [administrator@pg_primary ~]$ sudo runuser -l postgres -c "psql postgres://pg_secondary:5432,pg_primary:5432/testdb1?target_session_attrs=read-write -c 'show server_version;'"
          server_version          
----------------------------------
 12.0 (Ubuntu 12.0-2.pgdg18.04+1)
(1 row)

hope the offical packages will been released soon

Unable to create postgres on standby machine

getting below error when trying to setup secondary instance :

[postgres@lxlab1569252301 postgres_log]$ pg_autoctl create postgres --pgdata /postgres_data --monitor postgres://[email protected]:50001/pg_auto_failover
14:10:38 INFO  Using --nodename "lxlab1569252301.devlab.lowes.com", which resolves to IP address "172.29.142.23"
14:10:38 INFO  Continuing from a previous `pg_autoctl create` failed attempt
14:10:38 INFO  PostgreSQL state at registration time was: PGDATA does not exists
14:10:38 INFO  FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby
14:10:38 INFO  The primary node returned by the monitor is lxlab1569252313.devlab.lowes.com:50001
14:10:38 INFO  Initialising PostgreSQL as a hot standby
14:10:38 INFO  Target directory exists: "/postgres_data", stopping PostgreSQL
14:10:38 INFO  pg_ctl: no server running

14:10:38 INFO  pg_ctl stop failed, but PostgreSQL is not running anyway
could not remove file or directory "//backup": Permission denied
14:10:38 ERROR Failed to remove directory "//backup": Permission denied
14:10:38 ERROR Failed initialise standby server, see above for details
14:10:38 ERROR Failed to transition from state "wait_standby" to state "catchingup", see above.

i dont know from its checking /backup folder .

successfully created monitor and primary instance.
Version is 11.2 and Os types is linux x86_64.
Followed below link for setup
https://github.com/citusdata/pg_auto_failover

no auto start after reboot with default systemd configuration ?

OS: ubuntu 18.04
pg_autofailover: 1.0.4

I follow the document to create a systemd file using command

pg_autoctl -q show systemd --pgdata "/var/lib/postgresql/11/main" | sudo tee /etc/systemd/system/pgautofailover.service

and the pgautofailover.service

2019-09-20 16:52:55 [administrator@pg-slave ~]$ sudo cat /etc/systemd/system/pgautofailover.service 
[Unit]
Description = pg_auto_failover

[Service]
WorkingDirectory = /var/lib/postgresql/11/main
Environment = 'PGDATA=/var/lib/postgresql/11/main'
User = postgres
ExecStart = /usr/bin/pg_autoctl run
Restart = always
StartLimitBurst = 0

[Install]
WantedBy = multi-user.target

I can control pg_autofailover manually with systemd , like

sudo systemctl stop pgautofailover
sudo systemctl start pgautofailover

but when the node goes to reboot , pgautofailover not automatically start
I have to login to node and do sudo systemctl start pgautofailover

any suggestions ?

warnings during manpage build needs fixing

I got following warnings when building man pages for 1.0.5 release

   4 /citus-rpm-build/pg-auto-failover-enterprise-1.0.5/docs/index.rst:7: WARNING: Title underline too short.
   4 /citus-rpm-build/pg-auto-failover-enterprise-1.0.5/docs/install.rst:115: WARNING: Error in "code-block" directive:
   4 /citus-rpm-build/pg-auto-failover-enterprise-1.0.5/docs/operations.rst:87: WARNING: Unexpected indentation.
   3 /citus-rpm-build/pg-auto-failover-enterprise-1.0.5/docs/operations.rst:93: WARNING: Block quote ends without a blank line; unexpected unindent.
   4 /citus-rpm-build/pg-auto-failover-enterprise-1.0.5/docs/quickstart.rst:20: WARNING: Title underline too short.
   1 looking for now-outdated files... /citus-rpm-build/pg-auto-failover-enterprise-1.0.5/docs/operations.rst:93: WARNING: Block quote ends without a blank line; unexpected unindent.
   4 pg_autoctl.1 { } /citus-rpm-build/pg-auto-failover-enterprise-1.0.5/docs/ref/reference.rst:356: WARNING: undefined label: _pg_autoctl_create_monitor (if the link has no caption the label must precede a section header)
   4 writing... pg_auto_failover.1 { intro quickstart quickstart-cluster architecture fault-tolerance install ref/reference ref/configuration operations fsm } /citus-rpm-build/pg-auto-failover-enterprise-1.0.5/docs/ref/reference.rst:356: WARNING: undefined label: _pg_autoctl_create_monitor (if the link has no caption the label must precede a section header)

Logging it here to be fixed for the next version

Permissions missing in monitor database

Hi,

is this a known issue?

postgres@pg-af3:/home/postgres/ [af] pg_autoctl show state
Name |   Port | Group |  Node |     Current State |    Assigned State
---------------------------+--------+-------+-------+-------------------+------------------
pg-af1.it.dbi-services.com |   5432 |     0 |     1 |           primary |           primary
pg-af2.it.dbi-services.com |   5432 |     0 |     2 |         secondary |         secondary

17:29:11 postgres@pg-af3:/home/postgres/ [af] psql postgres://autoctl@pg-af3:5432/pg_auto_failover
psql (12.0 dbi services build)
Type "help" for help.

pg_auto_failover=> select pgautofailover.perform_failover(formation_id => 'default', group_id => 0);
ERROR:  permission denied for schema pgautofailover
LINE 1: select pgautofailover.perform_failover(formation_id => 'defa...
^

PostgreSQL 12 on CentOS 8.
pg_autoctl version 1.0.5

Not an issue to fix it manually but I still wonder why permissions are missing in the monitor database. Did I miss something?

Best regards
Daniel

pg_hba.conf error when creating secondary if reverse DNS is unavailable

I tried setting up pg_auto_failover on a CentOS 7 Azure VM (locally), but when setting up the standby I ran into the following issue:

$ pg_autoctl create postgres --monitor postgres://autoctl_node@pgaf-test.dmqu0iik0f1uppllrfdl2ayvwe.bx.internal.cloudapp.net:6000/pg_auto_failover --pgport 6002 --nodename `hostname --fqdn`  --pgdata node2   
...
03:39:15 ERROR pg_basebackup: could not connect to server: FATAL:  no pg_hba.conf entry for replication connection from host "10.1.19.4", user "pgautofailover_replicator", SSL off

03:39:15 ERROR Failed to run pg_basebackup: exit code 1
03:39:15 ERROR Failed initialise standby server, see above for details
03:39:15 ERROR Failed to transition from state "wait_standby" to state "catchingup", see above.

In the pg_hba.conf, the following entries were added:

host all "pgautofailover_monitor" pgaf-test.dmqu0iik0f1uppllrfdl2ayvwe.bx.internal.cloudapp.net trust # Auto-generated by pg_auto_failover
host replication "pgautofailover_replicator" pgaf-test.dmqu0iik0f1uppllrfdl2ayvwe.bx.internal.cloudapp.net trust # Auto-generated by pg_auto_failover
host "postgres" "pgautofailover_replicator" pgaf-test.dmqu0iik0f1uppllrfdl2ayvwe.bx.internal.cloudapp.net trust # Auto-generated by pg_auto_failover

Using hostnames in pg_hba.conf relies on reverse DNS of the client IP. However, reverse DNS of internal IPs does not work on Azure.

Adding pg_hba.conf records for 10.1.19.4/32 on the primary and running SELECT pg_reload_conf() before creating the standby resolved the problem.

Build error against PostgreSQL 12

Hi,

pg_auto_failover cannot be built against PostgreSQL 12:

+ PG_CONFIG=/usr/pgsql-12/bin/pg_config
+ /usr/bin/make -j3
ipaddr.c: In function 'findHostnameFromLocalIpAddress':
ipaddr.c:548:12: warning: unused variable 'address_len' [-Wunused-variable]
  socklen_t address_len;
            ^~~~~~~~~~~
ipaddr.c:547:26: warning: unused variable 'address' [-Wunused-variable]
  struct sockaddr_storage address;
                          ^~~~~~~
health_check_worker.c: In function 'HealthCheckWorkerLauncherMain':
health_check_worker.c:233:20: error: implicit declaration of function 'AllocSetContextCreateExtended'; did you mean 'AllocSetContextCreateInternal'? [-Werror=implicit-function-declaration]
  launcherContext = AllocSetContextCreateExtended(CurrentMemoryContext,
                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                    AllocSetContextCreateInternal
health_check_worker.c:233:18: error: assignment to 'MemoryContext' {aka 'struct MemoryContextData *'} from 'int' makes pointer from integer without a cast [-Werror=int-conversion]
  launcherContext = AllocSetContextCreateExtended(CurrentMemoryContext,
                  ^
health_check_worker.c: In function 'BuildDatabaseList':
health_check_worker.c:363:9: error: implicit declaration of function 'heap_beginscan_catalog'; did you mean 'table_beginscan_catalog'? [-Werror=implicit-function-declaration]
  scan = heap_beginscan_catalog(pgDatabaseRelation, 0, NULL);
         ^~~~~~~~~~~~~~~~~~~~~~
         table_beginscan_catalog
health_check_worker.c:363:7: error: assignment to 'HeapScanDesc' {aka 'struct HeapScanDescData *'} from 'int' makes pointer from integer without a cast [-Werror=int-conversion]
  scan = heap_beginscan_catalog(pgDatabaseRelation, 0, NULL);
       ^
In file included from /usr/pgsql-12/include/server/postgres.h:46,
                 from health_check_worker.c:13:
health_check_worker.c:365:49: error: passing argument 1 of 'heap_getnext' from incompatible pointer type [-Werror=incompatible-pointer-types]
  while (HeapTupleIsValid(dbTuple = heap_getnext(scan, ForwardScanDirection)))
                                                 ^~~~
/usr/pgsql-12/include/server/c.h:626:48: note: in definition of macro 'PointerIsValid'
 #define PointerIsValid(pointer) ((const void*)(pointer) != NULL)
                                                ^~~~~~~
health_check_worker.c:365:9: note: in expansion of macro 'HeapTupleIsValid'
  while (HeapTupleIsValid(dbTuple = heap_getnext(scan, ForwardScanDirection)))
         ^~~~~~~~~~~~~~~~
In file included from health_check_worker.c:21:
/usr/pgsql-12/include/server/access/heapam.h:120:45: note: expected 'TableScanDesc' {aka 'struct TableScanDescData *'} but argument is of type 'HeapScanDesc' {aka 'struct HeapScanDescData *'}
 extern HeapTuple heap_getnext(TableScanDesc scan, ScanDirection direction);
                               ~~~~~~~~~~~~~~^~~~
cli_config.c: In function 'cli_config_check':
cli_config.c:82:6: warning: unused variable 'errors' [-Wunused-variable]
  int errors = 0;
      ^~~~~~
health_check_worker.c:378:19: error: implicit declaration of function 'HeapTupleGetOid'; did you mean 'HeapTupleIsValid'? [-Werror=implicit-function-declaration]
    entry->dboid = HeapTupleGetOid(dbTuple);
                   ^~~~~~~~~~~~~~~
                   HeapTupleIsValid
health_check_worker.c:387:15: error: passing argument 1 of 'heap_endscan' from incompatible pointer type [-Werror=incompatible-pointer-types]
  heap_endscan(scan);
               ^~~~
In file included from health_check_worker.c:21:
/usr/pgsql-12/include/server/access/heapam.h:119:40: note: expected 'TableScanDesc' {aka 'struct TableScanDescData *'} but argument is of type 'HeapScanDesc' {aka 'struct HeapScanDescData *'}
 extern void heap_endscan(TableScanDesc scan);
                          ~~~~~~~~~~~~~~^~~~
health_check_worker.c: In function 'HealthCheckWorkerMain':
health_check_worker.c:452:21: error: assignment to 'MemoryContext' {aka 'struct MemoryContextData *'} from 'int' makes pointer from integer without a cast [-Werror=int-conversion]
  healthCheckContext = AllocSetContextCreateExtended(CurrentMemoryContext,
                     ^
cli_service.c: In function 'cli_monitor_run':
cli_service.c:161:16: warning: unused variable 'existingPgSetup' [-Wunused-variable]
  PostgresSetup existingPgSetup = { 0 };
                ^~~~~~~~~~~~~~~
cli_service.c:155:7: warning: unused variable 'configFilePath' [-Wunused-variable]
  char configFilePath[MAXPGPATH];
       ^~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[2]: *** [<builtin>: health_check_worker.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [Makefile:16: monitor] Error 2
make[1]: *** Waiting for unfinished jobs....
cli_show.c: In function 'cli_show_uri':
cli_show.c:309:10: warning: unused variable 'monitor' [-Wunused-variable]
  Monitor monitor = { 0 };
          ^~~~~~~
cli_create_drop_node.c: In function 'cli_create_monitor':
cli_create_drop_node.c:427:7: warning: unused variable 'ipAddr' [-Wunused-variable]
  char ipAddr[BUFSIZE];
       ^~~~~~
cli_create_drop_node.c:425:16: warning: unused variable 'pgSetup' [-Wunused-variable]
  PostgresSetup pgSetup = config.pgSetup;
                ^~~~~~~
cli_root.c: In function 'root_options':
cli_root.c:114:6: warning: unused variable 'debugCount' [-Wunused-variable]
  int debugCount = 0;
      ^~~~~~~~~~
pgsetup.c: In function 'get_pgpid':
pgsetup.c:403:7: warning: unused variable 'line' [-Wunused-variable]
  char line[BUFSIZE];
       ^~~~
pgsetup.c:402:6: warning: unused variable 'lineno' [-Wunused-variable]
  int lineno;
      ^~~~~~
cli_do_root.c: In function 'keeper_cli_keeper_setup_getopts':
cli_do_root.c:231:23: warning: unused variable 'errors' [-Wunused-variable]
  int c, option_index, errors = 0;
                       ^~~~~~
cli_do_root.c:231:9: warning: unused variable 'option_index' [-Wunused-variable]
  int c, option_index, errors = 0;
         ^~~~~~~~~~~~
cli_do_root.c:231:6: warning: unused variable 'c' [-Wunused-variable]
  int c, option_index, errors = 0;
      ^
monitor_config.c: In function 'monitor_config_set_setting':
monitor_config.c:442:13: warning: unused variable 'option' [-Wunused-variable]
  IniOption *option;
             ^~~~~~
monitor_pg_init.c: In function 'monitor_pg_init':
monitor_pg_init.c:60:7: warning: unused variable 'postgresInstanceExists' [-Wunused-variable]
  bool postgresInstanceExists = pg_setup_pgdata_exists(&pgSetup);
       ^~~~~~~~~~~~~~~~~~~~~~

Can you please take a look?

Thanks, Devrim

Failure to create standby node on PGDG installed Postgres 11 on Ubuntu

Ubuntu 16.04, PostgreSQL 11.5 installed from PGDG

It appears that pg_autoctl doesn't know about the separation of configuration in the PGDG packages where config files are in /etc/postgresql/[VERSION]/[CLUSTER] instead of in PGDATA location and the restart doesn't use the right values.

Output:

postgres@appdb-server2:~$ pg_autoctl create postgres --pgport 5432 --monitor postgres://autoctl_node@appdb-monitor1/pg_auto_failover --nodename appdb-server2.testing.aweberint.com --dbname app --formation appdb
15:53:15 INFO  Continuing from a previous `pg_autoctl create` failed attempt
15:53:15 INFO  PostgreSQL state at registration time was: PostgreSQL is running and a primary server
15:53:15 INFO  FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby
15:53:15 INFO  The primary node returned by the monitor is appdb-server1.testing.aweberint.com:5432
15:53:15 INFO  Initialising PostgreSQL as a hot standby
15:53:15 INFO  Target directory exists: "/var/lib/postgresql/11/appdb", stopping PostgreSQL
15:53:20 INFO  Running /usr/bin/pg_basebackup -w -h appdb-server1.testing.aweberint.com -p 5432 --pgdata /var/lib/postgresql/11/backup -U pgautofailover_replicator --write-recovery-conf --max-rate 100M --wal-method=stream --slot pgautofailover_standby ...
15:53:21 INFO  pg_basebackup: initiating base backup, waiting for checkpoint to complete
WARNING:  skipping special file "./postgresql.conf"
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/F000028 on timeline 1
WARNING:  skipping special file "./pg_hba.conf"
pg_basebackup: starting background WAL receiver
    0/39116 kB (0%), 0/1 tablespace (...ostgresql/11/backup/backup_label)
WARNING:  skipping special file "./postgresql.conf"
WARNING:  skipping special file "./pg_hba.conf"
31220/39116 kB (79%), 0/1 tablespace (...gresql/11/backup/base/13086/2662)
39126/39126 kB (100%), 0/1 tablespace (...esql/11/backup/global/pg_control)
39126/39126 kB (100%), 1/1 tablespace                                         
pg_basebackup: write-ahead log end point: 0/F0000F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: base backup completed

15:53:21 INFO  Postgres is not running, starting postgres
15:53:21 INFO   /usr/bin/pg_ctl --pgdata /var/lib/postgresql/11/appdb --options "-p 5432" --options "-h *" --wait start
15:53:21 ERROR Failed to start PostgreSQL. pg_ctl start returned: 1
15:53:21 ERROR waiting for server to start....postgres: could not access the server configuration file "/var/lib/postgresql/11/appdb/postgresql.conf": No such file or directory
 stopped waiting

15:53:21 ERROR Failed initialise standby server, see above for details
15:53:21 ERROR Failed to transition from state "wait_standby" to state "catchingup", see above.

If I manually start Postgres after this, replication works, but I can't figure out how to get pg_autoctl to know the machine is a standby and replicating as it hangs:

postgres@appdb-server2:~$ pg_autoctl run
16:04:41 INFO  Managing PostgreSQL installation at "/var/lib/postgresql/11/appdb"
16:04:41 WARN  The `pg_autoctl create` did not complete, completing now.
16:04:41 INFO  Continuing from a previous `pg_autoctl create` failed attempt
16:04:41 INFO  PostgreSQL state at registration time was: PostgreSQL is running and a primary server
16:04:41 INFO  FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby
16:04:41 ERROR Failed to execute "SELECT * FROM pgautofailover.get_primary($1, $2)": ERROR:  schema "pgautofailover" does not exist
LINE 1: SELECT * FROM pgautofailover.get_primary($1, $2)
                      ^

16:04:41 ERROR Failed to get the primary node in the HA group from the monitor while running "SELECT * FROM pgautofailover.get_primary($1, $2)" with formation "" and group ID 0
16:04:41 ERROR Failed to initialise standby because get the primary node from the monitor failed, see above for details
16:04:41 ERROR Failed to transition from state "wait_standby" to state "catchingup", see above.
16:04:41 FATAL Failed to initialize pg_auto_failover service, see above for details

I could probably recover from this if I knew what manual steps to take.

pg_hba.conf issue adding a worker or coordinator by IP address

Problem occurs when you create a worker or coordinator with a nodename of an IP address like this:

pg_autoctl create worker --nodename 10.0.0.20

pg_hba.conf entries don't have the /32 suffix added to indicate that they're single-IP CIDR blocks, and the pg_hba.conf is not valid.

README Please provide alternate recipe for adding repos

curl https://install.citusdata.com/community/deb.sh | sudo bash

Encouraging people to blindly execute content from the internet as root is as dangerous as it is convenient - on both security grounds and failure modes. Please provide alternate default instructions to add your repos.

(I hasten to add that this project looks awesome and I’ll be looking to see if we can drop some homegrown equivalents - thank you for opening this!)

how do I connect to database from my application or psql ?

I follow the readme to build a monito-primary-secondary cluster
now replication is working , and when I shutdown primary, the secondary server will take over , and when the old primary comes back , it will join to the cluster as secondary , and the original secondary will promote to primary server , that's perfect !!
but I don't know to to connect to the database ??
I think it should not be connect via primary ip or secondary ip , if so , when the node goes down , the connection will fail.
in readme , it says connect URI is somewhing like

postgres://autoctl_node@monitor:5432/pg_auto_failover

yes , I can connect to the URI , but via that URI , I can not connect to the database which I just created in cluster.

so , what URI should I use in application or psql connect string ?

Design decisions regarding to fault-tolerance

Hello everyone!
Thanks for the interesting extension and adjacent tool (pg_autoctl utility).

I would like to know how cluster deals with different faults. Especially:

What happens if master node appears in isolation from monitor and other node? Will there be multimaster layout and as consequence split brain?
Is the monitor a single point of failure? What happens if monitor node falls?

health_check_worker.c: WaitForEvent(List *healthCheckList) uses non-crosscompatible machinery

WaitForEvent function heavily dependant on poll.h include, but in all other places, PostgreSQL uses such include tecnhique:

#ifdef HAVE_POLL_H
#include <poll.h>
#endif

If we do define include in such a manner we got:

pasha@PG480 MINGW64 /src/pg_auto_failover
$ make
make -C src/monitor/ all
make[1]: Entering directory '/src/pg_auto_failover/src/monitor'
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -O2  -std=c99 -Wall -Werror -Wno-unused-parameter -Iinclude -IC:/Users/pasha/Code/POSTGR~1/msys64/usr/local/POSTGR~1/include -I. -I./ -IC:/Users/pasha/Code/POSTGR~1/msys64/usr/local/POSTGR~1/include/server -IC:/Users/pasha/Code/POSTGR~1/msys64/usr/local/POSTGR~1/include/internal  -I./src/include/port/win32 -DEXEC_BACKEND  -IC:/Users/pasha/Code/POSTGR~1/msys64/usr/local/POSTGR~1/include/server/port/win32 -DWIN32_STACK_RLIMIT=4194304  -c -o health_check_worker.o health_check_worker.c
health_check_worker.c: In function 'HealthCheckWorkerLauncherMain':
health_check_worker.c:292:44: error: passing argument 2 of 'WaitForBackgroundWorkerStartup' from incompatible pointer type [-Werror=incompatible-pointer-types]
  292 |     WaitForBackgroundWorkerStartup(handle, &pid);
      |                                            ^~~~
      |                                            |
      |                                            int *
In file included from health_check_worker.c:28:
C:/Users/pasha/Code/POSTGR~1/msys64/usr/local/POSTGR~1/include/server/postmaster/bgworker.h:123:94: note: expected 'pid_t *' {aka 'long long int *'} but argument is of type 'int *'
  123 | extern BgwHandleStatus WaitForBackgroundWorkerStartup(BackgroundWorkerHandle *handle, pid_t *pid);
      |                                                                                       ~~~~~~~^~~
health_check_worker.c: In function 'WaitForEvent':
health_check_worker.c:637:64: error: invalid application of 'sizeof' to incomplete type 'struct pollfd'
  637 |  pollFDs = (struct pollfd *) palloc0(healthCheckCount * sizeof(struct pollfd));
      |                                                                ^~~~~~
health_check_worker.c:644:47: error: invalid use of undefined type 'struct pollfd'
  644 |   struct pollfd *pollFileDescriptor = &pollFDs[healthCheckIndex];
      |                                               ^
health_check_worker.c:644:47: error: dereferencing pointer to incomplete type 'struct pollfd'
health_check_worker.c:670:21: error: 'POLLERR' undeclared (first use in this function)
  670 |     pollEventMask = POLLERR | POLLIN;
      |                     ^~~~~~~
health_check_worker.c:670:21: note: each undeclared identifier is reported only once for each function it appears in
health_check_worker.c:670:31: error: 'POLLIN' undeclared (first use in this function); did you mean 'LPOLELINK'?
  670 |     pollEventMask = POLLERR | POLLIN;
      |                               ^~~~~~
      |                               LPOLELINK
health_check_worker.c:674:31: error: 'POLLOUT' undeclared (first use in this function)
  674 |     pollEventMask = POLLERR | POLLOUT;
      |                               ^~~~~~~
health_check_worker.c:694:15: error: implicit declaration of function 'poll' [-Werror=implicit-function-declaration]
  694 |  pollResult = poll(pollFDs, healthCheckCount, pollTimeout);
      |               ^~~~
health_check_worker.c:706:47: error: invalid use of undefined type 'struct pollfd'
  706 |   struct pollfd *pollFileDescriptor = &pollFDs[healthCheckIndex];
      |                                               ^
cc1.exe: all warnings being treated as errors
make[1]: *** [<builtin>: health_check_worker.o] Error 1
make[1]: Leaving directory '/src/pg_auto_failover/src/monitor'
make: *** [Makefile:16: monitor] Error 2

I think we need to rewrite that particular piece of code to be crosscompatible

multiple failovers may leave replication slot open in the secondary

after performing manual failover multiple times observed that pg_replication_slots table has an entry with active = f

ideally we should not have any replication slots entry in the secondary node.

Unable to build from source

Hi. I'm trying to setup pg_auto_failover on a CentOS machine (with rh-postgresql10).
After successfully running yum install -y pg-auto-failover10_11,
when I tried to run pg_autoctl to create monitor, it says pgautofailover extension is also required.

After reading the README.md file, I came to know that this extension is installed when we build from source.

Now, When I ran make command inside pg_auto_failover/src, I got below error:

[root@dc2-rhel1 monitor]# make
gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -DLINUX_OOM_SCORE_ADJ=0 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -fPIC -std=c99 -Wall -Werror -Wno-unused-parameter -Iinclude -I/usr/include -I. -I. -I/usr/include/pgsql/server -I/usr/include/pgsql/internal -D_GNU_SOURCE -I/usr/include/libxml2   -c -o metadata.o metadata.c
In file included from metadata.c:17:0:
metadata.h:16:30: fatal error: storage/lockdefs.h: No such file or directory
 #include "storage/lockdefs.h"
                              ^
compilation terminated.
make: *** [metadata.o] Error 1

Since it was looking for storage/lockdefs.h file, I downloaded it from postgresql server code and pasted it inside /usr/include/pgsql/server/storage/ and ran make command again which resulted in following error:

[root@dc2-rhel1 monitor]# make
gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -DLINUX_OOM_SCORE_ADJ=0 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -fPIC -std=c99 -Wall -Werror -Wno-unused-parameter -Iinclude -I/usr/include -I. -I. -I/usr/include/pgsql/server -I/usr/include/pgsql/internal -D_GNU_SOURCE -I/usr/include/libxml2   -c -o metadata.o metadata.c
In file included from /usr/include/pgsql/server/access/genam.h:20:0,
                 from metadata.c:19:
/usr/include/pgsql/server/storage/lock.h:528:16: error: redefinition of ‘struct xl_standby_lock’
 typedef struct xl_standby_lock
                ^
In file included from metadata.h:16:0,
                 from metadata.c:17:
/usr/include/pgsql/server/storage/lockdefs.h:48:16: note: originally defined here
 typedef struct xl_standby_lock
                ^
In file included from /usr/include/pgsql/server/access/genam.h:20:0,
                 from metadata.c:19:
/usr/include/pgsql/server/storage/lock.h:533:3: error: conflicting types for ‘xl_standby_lock’
 } xl_standby_lock;
   ^
In file included from metadata.h:16:0,
                 from metadata.c:17:
/usr/include/pgsql/server/storage/lockdefs.h:53:3: note: previous declaration of ‘xl_standby_lock’ was here
 } xl_standby_lock;
   ^
metadata.c:22:33: fatal error: access/htup_details.h: No such file or directory
 #include "access/htup_details.h"
                                 ^
compilation terminated.
make: *** [metadata.o] Error 1

Now I'm stuck here.
Please let me know if I'm missing something.

Release 1.0.1

These instructions assume you have $VERSION, $PROJECT, and $REPO environment variables set in your shell (e.g. 1.0.0, pgautofailover, and pg_auto_failover). With those set, code from most steps can be copy-pasted.

Prepare Project

Ensure all needed changes are in the relevant release-x.y branch. git log --cherry-pick --no-merges release-x.y...master can be helpful. Be sure to cherry-pick changes in the same order they were merged to the main branch (but do not cherry-pick merge commits themselves)
Add a CHANGELOG entry in the ~~master~~ release-x.y branch summarizing meaningful changes ( will merge release branch into master after releasing to update the changelog there )
Update version number in defaults.h
Get complete approval for the commit you're tagging before creating or pushing any tags. Tags should be immutable, so do not proceed to the next step until you're sure you have everything you want in your release branch
Use git tag -a -s v$VERSION to create an annotated, signed tag for the release. Summarize the release in the one-line tag annotation (beneath 52 characters). Push the tag with git push origin v$VERSION
Visit the project's releases page (e.g. open https://github.com/citusdata/$REPO/releases)
- Create a new release object for your git tag (i.e. v$VERSION). Leave the description blank (it will auto-fill with the tag description)

As for signing tags, use your own key, and ensure it's known to GitHub.

Update OS Packages

release 1.0.4

These instructions assume you have $VERSION, $PROJECT, and $REPO environment variables set in your shell (e.g. 1.0.0, pgautofailover, and pg_auto_failover). With those set, code from most steps can be copy-pasted.

Prepare Project

Ensure all needed changes are in the relevant release-x.y branch. git log --cherry-pick --no-merges release-x.y...master can be helpful. Be sure to cherry-pick changes in the same order they were merged to the main branch (but do not cherry-pick merge commits themselves)
Add a CHANGELOG entry in the ~~master~~ release-x.y branch summarizing meaningful changes ( will merge release branch into master after releasing to update the changelog there )
Update version number in defaults.h
Get complete approval for the commit you're tagging before creating or pushing any tags. Tags should be immutable, so do not proceed to the next step until you're sure you have everything you want in your release branch
Use git tag -a -s v$VERSION to create an annotated, signed tag for the release. Summarize the release in the one-line tag annotation (beneath 52 characters). Push the tag with git push origin v$VERSION
Visit the project's releases page (e.g. open https://github.com/citusdata/$REPO/releases)
- Create a new release object for your git tag (i.e. v$VERSION). Leave the description blank (it will auto-fill with the tag description)

As for signing tags, use your own key, and ensure it's known to GitHub.

Update OS Packages

Release 1.0.3

These instructions assume you have $VERSION, $PROJECT, and $REPO environment variables set in your shell (e.g. 1.0.0, pgautofailover, and pg_auto_failover). With those set, code from most steps can be copy-pasted.

Prepare Project

Ensure all needed changes are in the relevant release-x.y branch. git log --cherry-pick --no-merges release-x.y...master can be helpful. Be sure to cherry-pick changes in the same order they were merged to the main branch (but do not cherry-pick merge commits themselves)
Add a CHANGELOG entry in the ~~master~~ release-x.y branch summarizing meaningful changes ( will merge release branch into master after releasing to update the changelog there )
Update version number in defaults.h
Get complete approval for the commit you're tagging before creating or pushing any tags. Tags should be immutable, so do not proceed to the next step until you're sure you have everything you want in your release branch
Use git tag -a -s v$VERSION to create an annotated, signed tag for the release. Summarize the release in the one-line tag annotation (beneath 52 characters). Push the tag with git push origin v$VERSION
Visit the project's releases page (e.g. open https://github.com/citusdata/$REPO/releases)
- Create a new release object for your git tag (i.e. v$VERSION). Leave the description blank (it will auto-fill with the tag description)

As for signing tags, use your own key, and ensure it's known to GitHub.

Update OS Packages

Doesn't compile OOTB on FreeBSD

Needs this quick and dirty patch to compile and link successfully:

diff --git a/src/bin/pg_autoctl/Makefile b/src/bin/pg_autoctl/Makefile
index 64b2ee7..d0d25a8 100644
--- a/src/bin/pg_autoctl/Makefile
+++ b/src/bin/pg_autoctl/Makefile
@@ -33,7 +33,7 @@ CFLAGS += $(COMMON_LIBS)
 
 LIBS  = -L $(shell $(PG_CONFIG) --libdir)
 LIBS += -L $(shell $(PG_CONFIG) --pkglibdir)
-LIBS += -lpq -lpgport -lpgcommon
+LIBS += -lpq -lpgport -lpgcommon -lintl
 
 all: $(PG_AUTOCTL) ;
 
diff --git a/src/bin/pg_autoctl/cli_create_drop_node.c b/src/bin/pg_autoctl/cli_create_drop_node.c
index 637dafd..e696650 100644
--- a/src/bin/pg_autoctl/cli_create_drop_node.c
+++ b/src/bin/pg_autoctl/cli_create_drop_node.c
@@ -13,6 +13,7 @@
 #include <getopt.h>
 #include <signal.h>
 #include <string.h>
+#include <arpa/inet.h>
 
 #include "postgres_fe.h"
 
diff --git a/src/bin/pg_autoctl/keeper.c b/src/bin/pg_autoctl/keeper.c
index 901be36..9cf2714 100644
--- a/src/bin/pg_autoctl/keeper.c
+++ b/src/bin/pg_autoctl/keeper.c
@@ -7,6 +7,7 @@
  *
  */
 #include <fcntl.h>
+#include <sys/stat.h>
 #include <inttypes.h>
 #include <time.h>
 #include <stdlib.h>
diff --git a/src/bin/pg_autoctl/state.c b/src/bin/pg_autoctl/state.c
index 435a673..5cc87ec 100644
--- a/src/bin/pg_autoctl/state.c
+++ b/src/bin/pg_autoctl/state.c
@@ -8,6 +8,7 @@
  */
 
 #include <fcntl.h>
+#include <sys/stat.h>
 #include <inttypes.h>
 #include <stdlib.h>
 #include <string.h>

how to join a new node to exists cluster when one of the nodes was failure?

I already build a cluster with 3 nodes , and I want to simulate that one of the nodes were permanently failure and I want to join a new node to the cluster to replace the failure one.

so I shutdown one of the nodes , create a new node , and run the command

sudo runuser -l postgres -c "pg_autoctl create postgres --pgdata /var/lib/postgresql/11/main --pgport 5432 --nodename pg-third --monitor postgres://autoctl_node@monitor:5432/pg_auto_failover"

here comes the error

14:38:03 ERROR Failed to execute "SELECT * FROM pgautofailover.register_node($1, $2, $3, $4, $5, $6::pgautofailover.replication_state, $7)": ERROR:  group 0 already has 2 members

14:38:03 ERROR Failed to register node pg-third:5432 in group 0 of formation "default" with initial state "init", see previous lines for details
14:38:03 ERROR Failed to register the existing local Postgres node "pg-third:5432" running at "/var/lib/postgresql/11/main"to the pg_auto_failover monitor at postgres://autoctl_node@monitor:5432/pg_auto_failover, see above for details

maybe the process should be remove a node using pg_autoctl drop node first , then join the new node , but that command can only drop the host self, not able to drop other node in cluster . BUT , the node need to be drop was already dead and unrecoverable , then how to drop the dead node and join a new node into the cluster ?

Upgrade primary and standby via pg_autoctl

It would be useful if pg_autoctl had a command to upgrade the primary and standby to a new version of Postgres via pg_upgrade without triggering a failover.

Unable to get --pgdata value in pg_autoctl get config command

I hit this while implementing pg_autoctl set node candidate-priority xxx command. When investigated found existing pg_autoctl config get ... command also exhibits the same behavior.

I have a monitor and a primary node set up in the same machine.

When I run the command pg_autoctl config get timeout.network_partition_timeout --pgdata /Users/mtuncer/dev/ha/test/primary/
I got

11:01:32 INFO  Managing PostgreSQL installation at "/Users/mtuncer/dev/ha/test/primary/"
pg_autoctl config get: Get the value of a given pg_autoctl configuration variable
usage: pg_autoctl config get [ section.option ]

  --pgdata      path to data directory

if I set PGDATA= /Users/mtuncer/dev/ha/test/primary/ and run the same command without --pgdata I got successful.

11:05:08 INFO  Managing PostgreSQL installation at "/Users/mtuncer/dev/ha/test/primary"
20

However, if I run the earlier command while PGDATA is set I still get the same error

pg_autoctl config get timeout.network_partition_timeout --pgdata /Users/mtuncer/dev/ha/test/primary/
11:06:13 INFO  Managing PostgreSQL installation at "/Users/mtuncer/dev/ha/test/primary"
pg_autoctl config get: Get the value of a given pg_autoctl configuration variable
usage: pg_autoctl config get [ section.option ]

  --pgdata      path to data directory

This is happening on Mac OS 10.14 with Apple Clang 11.0.0

hapostgres / pg_auto_failover Goto Github PK

pg_auto_failover's People

Contributors

Stargazers

Watchers

Forkers

pg_auto_failover's Issues

Prepare Project

Update OS Packages

Prepare Project

Update OS Packages

Prepare Project

Update OS Packages

Prepare Project

Update OS Packages

Prepare Project

Update OS Packages

Prepare Project

Update OS Packages

Recommend Projects

Recommend Topics

Recommend Org