hapostgres / pg_auto_failover Goto Github PK
View Code? Open in Web Editor NEWPostgres extension and service for automated failover and high-availability
License: Other
Postgres extension and service for automated failover and high-availability
License: Other
I follow the readme to build a monito-primary-secondary cluster
now replication is working , and when I shutdown primary, the secondary server will take over , and when the old primary comes back , it will join to the cluster as secondary , and the original secondary will promote to primary server , that's perfect !!
but I don't know to to connect to the database ??
I think it should not be connect via primary ip or secondary ip , if so , when the node goes down , the connection will fail.
in readme , it says connect URI is somewhing like
postgres://autoctl_node@monitor:5432/pg_auto_failover
yes , I can connect to the URI , but via that URI , I can not connect to the database which I just created in cluster.
so , what URI should I use in application or psql connect string ?
Hello,
On Ubuntu 18 and Postgresql 11, I'm systematically unable to automatically perform a failover and have the old primary become the secondary.
For example, here I stopped Postgresql and pg_auto_failover. The secondary was then promoted. When I turn back on the old primary, it isn't able to become a secondary.
Oct 04 17:14:53 pg-test-2 systemd[1]: Started pg_auto_failover.
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO Managing PostgreSQL installation at "/postgresql/11/main"
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO Found a stale pidfile at "/tmp/pg_autoctl/postgresql/11/main/pg_autoctl.pid"
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 WARN Removing the stale pid file "/tmp/pg_autoctl/postgresql/11/main/pg_autoctl.pid"
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO The version of extenstion "pgautofailover" is "1.0" on the monitor
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO pg_autoctl service is starting
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO Calling node_active for node default/13/0 with current state: primary, PostgreSQL is not running, sync_state is "", WAL delta is -1.
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO Postgres is not running, starting postgres
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO /usr/lib/postgresql/11/bin/pg_ctl --pgdata /postgresql/11/main --options "-p 5432" --options "-h *" --wait start
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 WARN PostgreSQL was not running, restarted with pid 2122
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO FSM transition from "primary" to "demoted": A failover occurred, no longer primary
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO Transition complete: current state is now "demoted"
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO Calling node_active for node default/13/0 with current state: demoted, PostgreSQL is not running, sync_state is "", WAL delta is -1.
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO FSM transition from "demoted" to "catchingup": A new primary is available. First, try to rewind. If that fails, do a pg_basebackup.
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO The primary node returned by the monitor is pg-test-1.c.hunter-io.internal:5432
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO Rewinding PostgreSQL to follow new primary pg-test-1.c.hunter-io.internal:5432
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 ERROR Connection to database failed: could not connect to server: No such file or directory
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: Is the server running locally and accepting
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 ERROR Failed to get the postgresql.conf path from the local postgres server, see above for details
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 WARN Failed to rewind demoted primary to standby, trying pg_basebackup instead
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO Initialising PostgreSQL as a hot standby
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO Target directory exists: "/postgresql/11/main", stopping PostgreSQL
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO pg_ctl: no server running
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO pg_ctl stop failed, but PostgreSQL is not running anyway
Oct 04 17:14:53 pg-test-2 pg_autoctl[2105]: 17:14:53 INFO Running /usr/lib/postgresql/11/bin/pg_basebackup -w -h pg-test-1.c.hunter-io.internal -p 5432 --pgdata /postgresql/11/backup -U pgautofailover_replicator --write-recovery-conf --max-rate 100M --wal-method=stream --slot pgautofailover_standby ...
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 17:14:56 INFO pg_basebackup: initiating base backup, waiting for checkpoint to complete
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: pg_basebackup: checkpoint completed
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: pg_basebackup: write-ahead log start point: 0/53000028 on timeline 3
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: pg_basebackup: starting background WAL receiver
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 0/23816 kB (0%), 0/1 tablespace (/postgresql/11/backup/backup_label )
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 10345/23816 kB (43%), 0/1 tablespace (...tgresql/11/backup/base/13090/827)
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 23826/23826 kB (100%), 0/1 tablespace (...esql/11/backup/global/pg_control)
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 23826/23826 kB (100%), 1/1 tablespace
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: pg_basebackup: write-ahead log end point: 0/530000F8
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: pg_basebackup: waiting for background process to finish streaming ...
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: pg_basebackup: base backup completed
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 17:14:56 INFO Postgres is not running, starting postgres
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 17:14:56 INFO /usr/lib/postgresql/11/bin/pg_ctl --pgdata /postgresql/11/main --options "-p 5432" --options "-h *" --wait start
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 17:14:56 ERROR Failed to start PostgreSQL. pg_ctl start returned: 1
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 17:14:56 ERROR Failed to become standby server, see above for details
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 17:14:56 ERROR Failed to transition from state "demoted" to state "catchingup", see above.
Oct 04 17:14:56 pg-test-2 pg_autoctl[2105]: 17:14:56 ERROR Failed to transition to state "catchingup", retrying...
This problem can manually be fixed by removing the node an adding it back, but of course, that's not ideal :)
Any help on how to fix this is welcome! Thanks.
We have many uses of char buffer[BUFFER_LENGTH]
and strcpy
, sprintf
to that buffer.
Although we haven't seen a problem yet. Using fixed size buffer to host dynamic length content and having unchecked access to these buffers is not safe.
We should use PQExpBuffer
and appendPQExpBufferStr()
functions to modify that buffer.
I'm wondering whether pg_auto_failover
could be deployed across two datacenters or availability zones, provided they are close enough to still allow for sync rep.
AFAICT, pg_auto_failover
allows only one standby per group, correct? Otherwise, one could maybe deploy the monitor and the primary in different DCs and have one standby in each. This way, the standby local to the monitor could be promoted if the other DC is down, or replication would just go on with the monitor DC down.
If one would deploy the standby in the same DC as the monitor, and that one would go down, the primary would be all alone. Right now, it seems to stay in sync rep (i.e. is blocked) until the other DC is up again, even though it knows that both its standby and the monitor are gone:
17:40:26 ERROR Failed to get node state for node 8 (10.0.3.140:55432) in group 0 of formation "default" with initial state "primary", replication state "", and replication lag -1, see previous lines for details
17:40:26 ERROR Failed to get the goal state from the monitor
17:40:26 WARN Checking for network partitions...
17:40:31 ERROR PostgreSQL primary server has lost track of its standby: pg_stat_replication reports no client using the slot "pgautofailover_standby".
Would it be possible for the primary to march on and clear synchronous_standby_names
, or would this be an architecture violation (not going through the FSM or so)? Any other suggestions on how to setup pg_auto_failover
across two DCs?
These instructions assume you have $VERSION
, $PROJECT
, and $REPO
environment variables set in your shell (e.g. 1.0.0
, pgautofailover
, and pg_auto_failover
). With those set, code from most steps can be copy-pasted.
release-x.y
branch. git log --cherry-pick --no-merges release-x.y...master
can be helpful. Be sure to cherry-pick changes in the same order they were merged to the main branch (but do not cherry-pick merge commits themselves)CHANGELOG
entry in the master
release-x.y
branch summarizing meaningful changes ( will merge release branch into master after releasing to update the changelog there )defaults.h
git tag -a -s v$VERSION
to create an annotated, signed tag for the release. Summarize the release in the one-line tag annotation (beneath 52 characters). Push the tag with git push origin v$VERSION
open https://github.com/citusdata/$REPO/releases
)
v$VERSION
). Leave the description blank (it will auto-fill with the tag description)As for signing tags, use your own key, and ensure it's known to GitHub.
debian-$PROJECT
branch of the packaging repository; create a new branch for your changes
$VERSION
, stable
) to the debian/changelog
filepkglatest
variable in the pkgvars
file to $VERSION
citus_package -p=debian/jessie -p=debian/stretch -p=ubuntu/bionic -p=ubuntu/xenial -p=ubuntu/trusty local release-$VERSION 2>&1 | tee citus_package.log
citus_package -p=debian/jessie -p=debian/stretch -p=ubuntu/bionic -p=ubuntu/xenial -p=ubuntu/trusty local release 2>&1 | tee citus_package.log
grep -Ei '(warning|\bi|\be|\bw):' citus_package.log | sort | uniq -c
. Ignore any warnings about using a gain-root-command while being root or Recognised distributionsredhat-$PROJECT
branch of the packaging repository; create a new branch for your changes
pkglatest
variable in the pkgvars
file to $VERSION
$PROJECT.spec
file, being sure to:
Version:
fieldSource0:
field$VERSION
) to the %changelog
sectioncitus_package -p=el/7 -p=el/6 -p=ol/7 -p=ol/6 local release-$VERSION 2>&1 | tee citus_package.log
citus_package -p=el/7 -p=el/6 -p=ol/7 -p=ol/6 local release 2>&1 | tee citus_package.log
grep -Ei '(warning|\bi|\be|\bw):' citus_package.log | sort | uniq -c | grep -v "\.c"
. Ignore any errors about --disable-dependency-tracking
It would be useful if pg_autoctl
had a command to change a setting like max_connections
or shared_buffers
when streaming replication is set up. We basically need to make sure that both servers have the setting in their postgresql.conf (or include) and then do a restart, preferably without triggering a failover.
In my specific case, the container has two interfaces. The IP from interface 1 was added to the pg_hba.conf - however connections seem to it from another container on the same network come from the secondary interface.
Oddly enough, the original "registration attempt" also is performed over the same network interface, so I'm confused as to why it decided to take the IP from the primary interface.
As an aside, there are many issues/inconveniences in the area of DNS resolution / pg_hba entries. For instance, given a setup of 3 containers: monitor, db1, db2: db1 and db2 register itself with monitor, db2 fails ot perform a pg_basebackup from db1 since db1 has 'db2' added in it's pg_hba.conf -- however somehow pg_basebackup tells me it can't resolve 'db1' to an IP whereas ping, dig, host have no problems in doing so.
Hi,
I would like to know if this extension is compatible with citus data extension. As far as I know, the citus extension only provide distribution of the data, and query, but not auto failover.
If it is compatible, is there a writeup somewhere I can learn how to do it?
Thank you
I was unable to create monitor using 1.0.4 release with the following command
pg_autoctl create monitor --pgdata ${WORK_DIR}/monitor --pgport 6000 --nodename localhost
It failed with
LOG: syntax error in file "/Users/mtuncer/dev/ha/test/monitor/postgresql-auto-failover.conf" line 4, near end of line
FATAL: configuration file "/Users/mtuncer/dev/ha/test/monitor/postgresql.conf" contains errors
when I checked postgresql-auto-failover.conf
failure was port setting file with no value set.
port =
I checked monitor/pg_autoctl.cfg
that is stored in temporary cache. Port setting was correctly applied. We just failed to move that setting into postgresql-auto-failover.conf
I tested this with ubuntu, it was fine.
I also testhed v1.0.3 with mac which is also fine. It must be something we introduced along 1.0.4
slave/standby node postgres file lost, pg_autoctl can't startup postgres, pg_autoctl can't auto run pg_basebackup to rebuild it.
psql postgres://autoctl_node@monitor/pg_auto_failover
psql (11.3 (Ubuntu 11.3-1.pgdg18.04+1))
Type "help" for help.
pg_auto_failover=> select pgautofailover.perform_failover(formation_id => 'default', group_id => 0);
ERROR: permission denied for table node
CONTEXT: SQL statement "UPDATE pgautofailover.node SET goalstate = $1, statechangetime = now() WHERE nodename = $2 AND nodeport = $3"
pg_auto_failover=>
These instructions assume you have $VERSION
, $PROJECT
, and $REPO
environment variables set in your shell (e.g. 1.0.0
, pgautofailover
, and pg_auto_failover
). With those set, code from most steps can be copy-pasted.
release-x.y
branch. git log --cherry-pick --no-merges release-x.y...master
can be helpful. Be sure to cherry-pick changes in the same order they were merged to the main branch (but do not cherry-pick merge commits themselves)CHANGELOG
entry in the master
release-x.y
branch summarizing meaningful changes ( will merge release branch into master after releasing to update the changelog there )git tag -a -s v$VERSION
to create an annotated, signed tag for the release. Summarize the release in the one-line tag annotation (beneath 52 characters). Push the tag with git push origin v$VERSION
open https://github.com/citusdata/$REPO/releases
)
v$VERSION
). Leave the description blank (it will auto-fill with the tag description)As for signing tags, use your own key, and ensure it's known to GitHub.
debian-$PROJECT
branch of the packaging repository; create a new branch for your changes
$VERSION
, stable
) to the debian/changelog
filepkglatest
variable in the pkgvars
file to $VERSION
citus_package -p=debian/jessie -p=debian/stretch -p=ubuntu/bionic -p=ubuntu/xenial -p=ubuntu/trusty local release 2>&1 | tee citus_package.log
citus_package -p=debian/jessie -p=debian/stretch -p=ubuntu/bionic -p=ubuntu/xenial -p=ubuntu/trusty local nightly 2>&1 | tee citus_package.log
grep -Ei '(warning|\bi|\be|\bw):' citus_package.log | sort | uniq -c
. Ignore any warnings about using a gain-root-command while being root or Recognised distributionsredhat-$PROJECT
branch of the packaging repository; create a new branch for your changes
pkglatest
variable in the pkgvars
file to $VERSION
$PROJECT.spec
file, being sure to:
Version:
fieldSource0:
field$VERSION
) to the %changelog
sectioncitus_package -p=el/7 -p=el/6 -p=ol/7 -p=ol/6 local release 2>&1 | tee citus_package.log
citus_package -p=el/7 -p=el/6 -p=ol/7 -p=ol/6 local nightly 2>&1 | tee citus_package.log
grep -Ei '(warning|\bi|\be|\bw):' citus_package.log | sort | uniq -c | grep -v "\.c"
. Ignore any errors about --disable-dependency-tracking
These instructions assume you have $VERSION
, $PROJECT
, and $REPO
environment variables set in your shell (e.g. 1.0.5
, pgautofailover
, and pg_auto_failover
). With those set, code from most steps can be copy-pasted.
release-x.y
branch. git log --cherry-pick --no-merges release-x.y...master
can be helpful. Be sure to cherry-pick changes in the same order they were merged to the main branch (but do not cherry-pick merge commits themselves)CHANGELOG
entry in the master
release-x.y
branch summarizing meaningful changes ( will merge release branch into master after releasing to update the changelog there )defaults.h
git tag -a -s v$VERSION
to create an annotated, signed tag for the release. Summarize the release in the one-line tag annotation (beneath 52 characters). Push the tag with git push origin v$VERSION
open https://github.com/citusdata/$REPO/releases
)
v$VERSION
). Leave the description blank (it will auto-fill with the tag description)As for signing tags, use your own key, and ensure it's known to GitHub.
debian-$PROJECT
branch of the packaging repository; create a new branch for your changes
$VERSION
, stable
) to the debian/changelog
filepkglatest
variable in the pkgvars
file to $VERSION
citus_package -p=debian/jessie -p=debian/stretch -p=ubuntu/bionic -p=ubuntu/xenial -p=ubuntu/trusty local release-$VERSION 2>&1 | tee citus_package.log
citus_package -p=debian/jessie -p=debian/stretch -p=ubuntu/bionic -p=ubuntu/xenial -p=ubuntu/trusty local release 2>&1 | tee citus_package.log
grep -Ei '(warning|\bi|\be|\bw):' citus_package.log | sort | uniq -c
. Ignore any warnings about using a gain-root-command while being root or Recognised distributionsredhat-$PROJECT
branch of the packaging repository; create a new branch for your changes
pkglatest
variable in the pkgvars
file to $VERSION
$PROJECT.spec
file, being sure to:
Version:
fieldSource0:
field$VERSION
) to the %changelog
sectioncitus_package -p=el/7 -p=el/6 -p=ol/7 -p=ol/6 local release-$VERSION 2>&1 | tee citus_package.log
citus_package -p=el/7 -p=el/6 -p=ol/7 -p=ol/6 local release 2>&1 | tee citus_package.log
grep -Ei '(warning|\bi|\be|\bw):' citus_package.log | sort | uniq -c | grep -v "\.c"
. Ignore any errors about --disable-dependency-tracking
I hit this while implementing pg_autoctl set node candidate-priority xxx
command. When investigated found existing pg_autoctl config get ...
command also exhibits the same behavior.
I have a monitor and a primary node set up in the same machine.
When I run the command pg_autoctl config get timeout.network_partition_timeout --pgdata /Users/mtuncer/dev/ha/test/primary/
I got
11:01:32 INFO Managing PostgreSQL installation at "/Users/mtuncer/dev/ha/test/primary/"
pg_autoctl config get: Get the value of a given pg_autoctl configuration variable
usage: pg_autoctl config get [ section.option ]
--pgdata path to data directory
if I set PGDATA= /Users/mtuncer/dev/ha/test/primary/
and run the same command without --pgdata I got successful.
11:05:08 INFO Managing PostgreSQL installation at "/Users/mtuncer/dev/ha/test/primary"
20
However, if I run the earlier command while PGDATA is set I still get the same error
pg_autoctl config get timeout.network_partition_timeout --pgdata /Users/mtuncer/dev/ha/test/primary/
11:06:13 INFO Managing PostgreSQL installation at "/Users/mtuncer/dev/ha/test/primary"
pg_autoctl config get: Get the value of a given pg_autoctl configuration variable
usage: pg_autoctl config get [ section.option ]
--pgdata path to data directory
This is happening on Mac OS 10.14 with Apple Clang 11.0.0
Problem occurs when you create a worker or coordinator with a nodename of an IP address like this:
pg_autoctl create worker --nodename 10.0.0.20
pg_hba.conf entries don't have the /32 suffix added to indicate that they're single-IP CIDR blocks, and the pg_hba.conf is not valid.
OS: ubuntu 18.04
pg_autofailover: 1.0.4
I follow the document to create a systemd file using command
pg_autoctl -q show systemd --pgdata "/var/lib/postgresql/11/main" | sudo tee /etc/systemd/system/pgautofailover.service
and the pgautofailover.service
2019-09-20 16:52:55 [administrator@pg-slave ~]$ sudo cat /etc/systemd/system/pgautofailover.service
[Unit]
Description = pg_auto_failover
[Service]
WorkingDirectory = /var/lib/postgresql/11/main
Environment = 'PGDATA=/var/lib/postgresql/11/main'
User = postgres
ExecStart = /usr/bin/pg_autoctl run
Restart = always
StartLimitBurst = 0
[Install]
WantedBy = multi-user.target
I can control pg_autofailover manually with systemd , like
sudo systemctl stop pgautofailover
sudo systemctl start pgautofailover
but when the node goes to reboot , pgautofailover not automatically start
I have to login to node and do sudo systemctl start pgautofailover
any suggestions ?
It'd be nice to have recommendations on how to do backups:
These instructions assume you have $VERSION
, $PROJECT
, and $REPO
environment variables set in your shell (e.g. 1.0.0
, pgautofailover
, and pg_auto_failover
). With those set, code from most steps can be copy-pasted.
release-x.y
branch. git log --cherry-pick --no-merges release-x.y...master
can be helpful. Be sure to cherry-pick changes in the same order they were merged to the main branch (but do not cherry-pick merge commits themselves)CHANGELOG
entry in the master
release-x.y
branch summarizing meaningful changes ( will merge release branch into master after releasing to update the changelog there )defaults.h
git tag -a -s v$VERSION
to create an annotated, signed tag for the release. Summarize the release in the one-line tag annotation (beneath 52 characters). Push the tag with git push origin v$VERSION
open https://github.com/citusdata/$REPO/releases
)
v$VERSION
). Leave the description blank (it will auto-fill with the tag description)As for signing tags, use your own key, and ensure it's known to GitHub.
debian-$PROJECT
branch of the packaging repository; create a new branch for your changes
$VERSION
, stable
) to the debian/changelog
filepkglatest
variable in the pkgvars
file to $VERSION
citus_package -p=debian/jessie -p=debian/stretch -p=ubuntu/bionic -p=ubuntu/xenial -p=ubuntu/trusty local release-$VERSION 2>&1 | tee citus_package.log
citus_package -p=debian/jessie -p=debian/stretch -p=ubuntu/bionic -p=ubuntu/xenial -p=ubuntu/trusty local release 2>&1 | tee citus_package.log
grep -Ei '(warning|\bi|\be|\bw):' citus_package.log | sort | uniq -c
. Ignore any warnings about using a gain-root-command while being root or Recognised distributionsredhat-$PROJECT
branch of the packaging repository; create a new branch for your changes
pkglatest
variable in the pkgvars
file to $VERSION
$PROJECT.spec
file, being sure to:
Version:
fieldSource0:
field$VERSION
) to the %changelog
sectioncitus_package -p=el/7 -p=el/6 -p=ol/7 -p=ol/6 local release-$VERSION 2>&1 | tee citus_package.log
citus_package -p=el/7 -p=el/6 -p=ol/7 -p=ol/6 local release 2>&1 | tee citus_package.log
grep -Ei '(warning|\bi|\be|\bw):' citus_package.log | sort | uniq -c | grep -v "\.c"
. Ignore any errors about --disable-dependency-tracking
At the moment our command pg_autoctl create postgres
knows how to register a pre-existing PostgreSQL instance when it's a primary server. This allows users to provision as usual and only then register to the monitor.
There are two main cases when registering an existing standby:
In the first case we should check that the system_identifier
is the same as in the primary server and then continue with the standby initialisation as usual, only skipping the pg_basebackup
step.
In the second case, we should still check that the system_identifier
is the same as in the primary server, and then verify that the setup is done according to pg_auto_failover expectations:
This is a brain dump, the actual implementation might differ to a large extend with the thinking here, though I think this is the general idea we should pursue here.
Hi,
is this a known issue?
postgres@pg-af3:/home/postgres/ [af] pg_autoctl show state
Name | Port | Group | Node | Current State | Assigned State
---------------------------+--------+-------+-------+-------------------+------------------
pg-af1.it.dbi-services.com | 5432 | 0 | 1 | primary | primary
pg-af2.it.dbi-services.com | 5432 | 0 | 2 | secondary | secondary
17:29:11 postgres@pg-af3:/home/postgres/ [af] psql postgres://autoctl@pg-af3:5432/pg_auto_failover
psql (12.0 dbi services build)
Type "help" for help.
pg_auto_failover=> select pgautofailover.perform_failover(formation_id => 'default', group_id => 0);
ERROR: permission denied for schema pgautofailover
LINE 1: select pgautofailover.perform_failover(formation_id => 'defa...
^
PostgreSQL 12 on CentOS 8.
pg_autoctl version 1.0.5
Not an issue to fix it manually but I still wonder why permissions are missing in the monitor database. Did I miss something?
Best regards
Daniel
Hello!
First of all thanks for the effort you guys put into this, also kudos for using pg_rewind, awesome stuff !
While testing, i came across the following scenario : say that you lose the monitor some reason (in my test i just deleted the container). Replication keeps working, no issues there, but when i bring a new monitor up, i can add the master by removing the state file and issuing a "pg_autoctl create postgres", but when i try to add the slave, i get :
" 08:52:55 ERROR pg_autoctl doesn't know how to register an already existing standby server at the moment ". Of course, i can add it by rebuilding the slave but i was wondering if theres an alternative way doing this that won't require slave rebuild.
It would be useful if pg_autoctl
had a command to upgrade the primary and standby to a new version of Postgres via pg_upgrade
without triggering a failover.
I tried setting up pg_auto_failover on a CentOS 7 Azure VM (locally), but when setting up the standby I ran into the following issue:
$ pg_autoctl create postgres --monitor postgres://autoctl_node@pgaf-test.dmqu0iik0f1uppllrfdl2ayvwe.bx.internal.cloudapp.net:6000/pg_auto_failover --pgport 6002 --nodename `hostname --fqdn` --pgdata node2
...
03:39:15 ERROR pg_basebackup: could not connect to server: FATAL: no pg_hba.conf entry for replication connection from host "10.1.19.4", user "pgautofailover_replicator", SSL off
03:39:15 ERROR Failed to run pg_basebackup: exit code 1
03:39:15 ERROR Failed initialise standby server, see above for details
03:39:15 ERROR Failed to transition from state "wait_standby" to state "catchingup", see above.
In the pg_hba.conf, the following entries were added:
host all "pgautofailover_monitor" pgaf-test.dmqu0iik0f1uppllrfdl2ayvwe.bx.internal.cloudapp.net trust # Auto-generated by pg_auto_failover
host replication "pgautofailover_replicator" pgaf-test.dmqu0iik0f1uppllrfdl2ayvwe.bx.internal.cloudapp.net trust # Auto-generated by pg_auto_failover
host "postgres" "pgautofailover_replicator" pgaf-test.dmqu0iik0f1uppllrfdl2ayvwe.bx.internal.cloudapp.net trust # Auto-generated by pg_auto_failover
Using hostnames in pg_hba.conf relies on reverse DNS of the client IP. However, reverse DNS of internal IPs does not work on Azure.
Adding pg_hba.conf records for 10.1.19.4/32
on the primary and running SELECT pg_reload_conf()
before creating the standby resolved the problem.
Hello everyone!
Thanks for the interesting extension and adjacent tool (pg_autoctl
utility).
I would like to know how cluster deals with different faults. Especially:
What happens if master node appears in isolation from monitor and other node? Will there be multimaster layout and as consequence split brain?
Is the monitor a single point of failure? What happens if monitor node falls?
Ubuntu 16.04, PostgreSQL 11.5 installed from PGDG
It appears that pg_autoctl
doesn't know about the separation of configuration in the PGDG packages where config files are in /etc/postgresql/[VERSION]/[CLUSTER]
instead of in PGDATA
location and the restart doesn't use the right values.
Output:
postgres@appdb-server2:~$ pg_autoctl create postgres --pgport 5432 --monitor postgres://autoctl_node@appdb-monitor1/pg_auto_failover --nodename appdb-server2.testing.aweberint.com --dbname app --formation appdb
15:53:15 INFO Continuing from a previous `pg_autoctl create` failed attempt
15:53:15 INFO PostgreSQL state at registration time was: PostgreSQL is running and a primary server
15:53:15 INFO FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby
15:53:15 INFO The primary node returned by the monitor is appdb-server1.testing.aweberint.com:5432
15:53:15 INFO Initialising PostgreSQL as a hot standby
15:53:15 INFO Target directory exists: "/var/lib/postgresql/11/appdb", stopping PostgreSQL
15:53:20 INFO Running /usr/bin/pg_basebackup -w -h appdb-server1.testing.aweberint.com -p 5432 --pgdata /var/lib/postgresql/11/backup -U pgautofailover_replicator --write-recovery-conf --max-rate 100M --wal-method=stream --slot pgautofailover_standby ...
15:53:21 INFO pg_basebackup: initiating base backup, waiting for checkpoint to complete
WARNING: skipping special file "./postgresql.conf"
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/F000028 on timeline 1
WARNING: skipping special file "./pg_hba.conf"
pg_basebackup: starting background WAL receiver
0/39116 kB (0%), 0/1 tablespace (...ostgresql/11/backup/backup_label)
WARNING: skipping special file "./postgresql.conf"
WARNING: skipping special file "./pg_hba.conf"
31220/39116 kB (79%), 0/1 tablespace (...gresql/11/backup/base/13086/2662)
39126/39126 kB (100%), 0/1 tablespace (...esql/11/backup/global/pg_control)
39126/39126 kB (100%), 1/1 tablespace
pg_basebackup: write-ahead log end point: 0/F0000F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: base backup completed
15:53:21 INFO Postgres is not running, starting postgres
15:53:21 INFO /usr/bin/pg_ctl --pgdata /var/lib/postgresql/11/appdb --options "-p 5432" --options "-h *" --wait start
15:53:21 ERROR Failed to start PostgreSQL. pg_ctl start returned: 1
15:53:21 ERROR waiting for server to start....postgres: could not access the server configuration file "/var/lib/postgresql/11/appdb/postgresql.conf": No such file or directory
stopped waiting
15:53:21 ERROR Failed initialise standby server, see above for details
15:53:21 ERROR Failed to transition from state "wait_standby" to state "catchingup", see above.
If I manually start Postgres after this, replication works, but I can't figure out how to get pg_autoctl to know the machine is a standby and replicating as it hangs:
postgres@appdb-server2:~$ pg_autoctl run
16:04:41 INFO Managing PostgreSQL installation at "/var/lib/postgresql/11/appdb"
16:04:41 WARN The `pg_autoctl create` did not complete, completing now.
16:04:41 INFO Continuing from a previous `pg_autoctl create` failed attempt
16:04:41 INFO PostgreSQL state at registration time was: PostgreSQL is running and a primary server
16:04:41 INFO FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby
16:04:41 ERROR Failed to execute "SELECT * FROM pgautofailover.get_primary($1, $2)": ERROR: schema "pgautofailover" does not exist
LINE 1: SELECT * FROM pgautofailover.get_primary($1, $2)
^
16:04:41 ERROR Failed to get the primary node in the HA group from the monitor while running "SELECT * FROM pgautofailover.get_primary($1, $2)" with formation "" and group ID 0
16:04:41 ERROR Failed to initialise standby because get the primary node from the monitor failed, see above for details
16:04:41 ERROR Failed to transition from state "wait_standby" to state "catchingup", see above.
16:04:41 FATAL Failed to initialize pg_auto_failover service, see above for details
I could probably recover from this if I knew what manual steps to take.
These instructions assume you have $VERSION
, $PROJECT
, and $REPO
environment variables set in your shell (e.g. 1.0.0
, pgautofailover
, and pg_auto_failover
). With those set, code from most steps can be copy-pasted.
release-x.y
branch. git log --cherry-pick --no-merges release-x.y...master
can be helpful. Be sure to cherry-pick changes in the same order they were merged to the main branch (but do not cherry-pick merge commits themselves)CHANGELOG
entry in the master
release-x.y
branch summarizing meaningful changes ( will merge release branch into master after releasing to update the changelog there )defaults.h
git tag -a -s v$VERSION
to create an annotated, signed tag for the release. Summarize the release in the one-line tag annotation (beneath 52 characters). Push the tag with git push origin v$VERSION
open https://github.com/citusdata/$REPO/releases
)
v$VERSION
). Leave the description blank (it will auto-fill with the tag description)As for signing tags, use your own key, and ensure it's known to GitHub.
debian-$PROJECT
branch of the packaging repository; create a new branch for your changes
$VERSION
, stable
) to the debian/changelog
filepkglatest
variable in the pkgvars
file to $VERSION
citus_package -p=debian/jessie -p=debian/stretch -p=ubuntu/bionic -p=ubuntu/xenial -p=ubuntu/trusty local release-$VERSION 2>&1 | tee citus_package.log
citus_package -p=debian/jessie -p=debian/stretch -p=ubuntu/bionic -p=ubuntu/xenial -p=ubuntu/trusty local release 2>&1 | tee citus_package.log
grep -Ei '(warning|\bi|\be|\bw):' citus_package.log | sort | uniq -c
. Ignore any warnings about using a gain-root-command while being root or Recognised distributionsredhat-$PROJECT
branch of the packaging repository; create a new branch for your changes
pkglatest
variable in the pkgvars
file to $VERSION
$PROJECT.spec
file, being sure to:
Version:
fieldSource0:
field$VERSION
) to the %changelog
sectioncitus_package -p=el/7 -p=el/6 -p=ol/7 -p=ol/6 local release-$VERSION 2>&1 | tee citus_package.log
citus_package -p=el/7 -p=el/6 -p=ol/7 -p=ol/6 local release 2>&1 | tee citus_package.log
grep -Ei '(warning|\bi|\be|\bw):' citus_package.log | sort | uniq -c | grep -v "\.c"
. Ignore any errors about --disable-dependency-tracking
If the postgres of the monitor gets killed or crashes it is not correctly restarted by the 'keeper'.
This can be reproduced in 2 sessions:
Session 1
pg_autoctl run --pgdata /path/of/monitor
Session 2
kill <pid of monitor postgres>
(or a killall postgres for that matter)
pg_autoctl
will be in an error loop but not restarting the postgres process. This is different behaviour than we have when running a postgres node.
Is it possible to manually choose the node that will become primary or wait_primary?
thanks
I'm using Ansible to manage my PostgreSQL nodes and have to account for the line added by pg_auto_failover to pg_hba.conf in my Ansible config.
The line is also much more permissive than I'd prefer to allow.
Having an option that is along the lines of "--skip-pg_hba" that lets me manage that when using "create monitor" and "create postgres" would be ideal.
I already create a 3 nodes cluster and get pg_auto_failover.
but when I try to connect via URI , there's messages complain about pg_hba.conf
postgres@pg-slave:~$ psql -d postgres://pg-slave:5432,pg-primary:5432/postgres?target_session_attrs=read-write
psql: FATAL: no pg_hba.conf entry for host "192.168.11.153", user "postgres", database "postgres", SSL off
postgres@pg-slave:~$
so I modify /var/lib/postgresql/11/main/pg_hba.conf (my pgdata path) using the command below
echo "host all postgres 192.168.11.0/24 trust" >> /var/lib/postgresql/11/main/pg_hba.conf
tail -5 /var/lib/postgresql/11/main/pg_hba.conf
and the pg_hba.conf looks like
host "postgres" "postgres" pg-primary trust # Auto-generated by pg_auto_failover
host all "pgautofailover_monitor" monitor trust # Auto-generated by pg_auto_failover
host replication "pgautofailover_replicator" pg-slave trust # Auto-generated by pg_auto_failover
host "postgres" "pgautofailover_replicator" pg-slave trust # Auto-generated by pg_auto_failover
host all postgres 192.168.11.0/24 trust
ok , now , in normal situation , I have to restart postgresql service to make new pg_hba.conf works
but I have no idea how to do so with pg_auto_failover .
there is a systemd config with pg_auto_failover , but looks like not iomplement into OS ? (or it did , but I find nothing related in /etc/systemd/system/multi-user.target.wants)
so , here is my question , how do I reload/restart postgresql to make new pg_hba.conf works ?
I already try pg_autoctl reload
2019-09-19 07:51:58 [administrator@pg-slave ~]$ sudo runuser -l postgres -c "pg_autoctl reload --pgdata /var/lib/postgresql/11/main"
07:52:15 INFO Managing PostgreSQL installation at "/var/lib/postgresql/11/main"
2019-09-19 07:52:15 [administrator@pg-slave ~]$ sudo runuser -l postgres -c "psql -d postgres://pg-slave:5432,pg-primary:5432/postgres?target_session_attrs=read-write"
psql: FATAL: no pg_hba.conf entry for host "192.168.11.153", user "postgres", database "postgres", SSL off
2019-09-19 07:52:24 [administrator@pg-slave ~]$
stop the keeper and run again
2019-09-19 07:53:22 [administrator@pg-slave ~]$ sudo runuser -l postgres -c "pg_autoctl run --pgdata /var/lib/postgresql/11/main &"
2019-09-19 07:53:32 [administrator@pg-slave ~]$ 07:53:32 INFO Managing PostgreSQL installation at "/var/lib/postgresql/11/main"
07:53:32 INFO Found a stale pidfile at "/tmp/pg_autoctl/var/lib/postgresql/11/main/pg_autoctl.pid"
07:53:32 WARN Removing the stale pid file "/tmp/pg_autoctl/var/lib/postgresql/11/main/pg_autoctl.pid"
07:53:32 INFO The version of extenstion "pgautofailover" is "1.0" on the monitor
07:53:32 INFO pg_autoctl service is starting
07:53:32 INFO Calling node_active for node default/2/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.
2019-09-19 07:53:34 [administrator@pg-slave ~]$ sudo runuser -l postgres -c "psql -d postgres://pg-slave:5432,pg-primary:5432/postgres?target_session_attrs=read-write"07:53:37 INFO Calling node_active for node default/2/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.
psql: FATAL: no pg_hba.conf entry for host "192.168.11.153", user "postgres", database "postgres", SSL off
.
please , any suggestions ?
It would be very convenient for me to be able to put a node into maintenance mode from that node's machine. Main use case is preparation for a minor PG upgrade.
Hi. I'm trying to setup pg_auto_failover on a CentOS machine (with rh-postgresql10).
After successfully running yum install -y pg-auto-failover10_11,
when I tried to run pg_autoctl to create monitor, it says pgautofailover extension is also required.
After reading the README.md file, I came to know that this extension is installed when we build from source.
Now, When I ran make command inside pg_auto_failover/src, I got below error:
[root@dc2-rhel1 monitor]# make
gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -DLINUX_OOM_SCORE_ADJ=0 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -fPIC -std=c99 -Wall -Werror -Wno-unused-parameter -Iinclude -I/usr/include -I. -I. -I/usr/include/pgsql/server -I/usr/include/pgsql/internal -D_GNU_SOURCE -I/usr/include/libxml2 -c -o metadata.o metadata.c
In file included from metadata.c:17:0:
metadata.h:16:30: fatal error: storage/lockdefs.h: No such file or directory
#include "storage/lockdefs.h"
^
compilation terminated.
make: *** [metadata.o] Error 1
Since it was looking for storage/lockdefs.h file, I downloaded it from postgresql server code and pasted it inside /usr/include/pgsql/server/storage/ and ran make command again which resulted in following error:
[root@dc2-rhel1 monitor]# make
gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -DLINUX_OOM_SCORE_ADJ=0 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -fPIC -std=c99 -Wall -Werror -Wno-unused-parameter -Iinclude -I/usr/include -I. -I. -I/usr/include/pgsql/server -I/usr/include/pgsql/internal -D_GNU_SOURCE -I/usr/include/libxml2 -c -o metadata.o metadata.c
In file included from /usr/include/pgsql/server/access/genam.h:20:0,
from metadata.c:19:
/usr/include/pgsql/server/storage/lock.h:528:16: error: redefinition of ‘struct xl_standby_lock’
typedef struct xl_standby_lock
^
In file included from metadata.h:16:0,
from metadata.c:17:
/usr/include/pgsql/server/storage/lockdefs.h:48:16: note: originally defined here
typedef struct xl_standby_lock
^
In file included from /usr/include/pgsql/server/access/genam.h:20:0,
from metadata.c:19:
/usr/include/pgsql/server/storage/lock.h:533:3: error: conflicting types for ‘xl_standby_lock’
} xl_standby_lock;
^
In file included from metadata.h:16:0,
from metadata.c:17:
/usr/include/pgsql/server/storage/lockdefs.h:53:3: note: previous declaration of ‘xl_standby_lock’ was here
} xl_standby_lock;
^
metadata.c:22:33: fatal error: access/htup_details.h: No such file or directory
#include "access/htup_details.h"
^
compilation terminated.
make: *** [metadata.o] Error 1
Now I'm stuck here.
Please let me know if I'm missing something.
If the primary has been created but pg_autoctl is not running, then pg_autoctl create
on the standby will get stuck at:
03:33:33 INFO FSM transition from "init" to "wait_standby": Start following a primary
03:33:33 INFO Transition complete: current state is now "wait_standby"
It would be nice if we gave a hint to the user that pg_autoctl run
needs to be running at the primary at this point.
I am trying to upgrade postgresql to postgresql-12 , but from the citusdata apt repository , there is only postgresql-11-auto-failover-1.0 , and with the package , postgresql-11 will be installed.
so there are two versions of postggresql in one node
2019-10-23 07:01:07 [administrator@pg_primary ~]$ sudo apt list --installed|grep postgresql
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
postgresql-11/bionic-pgdg,now 11.5-3.pgdg18.04+1 amd64 [installed,automatic]
postgresql-11-auto-failover-1.0/bionic,now 1.0.5 amd64 [installed]
postgresql-12/bionic-pgdg,now 12.0-2.pgdg18.04+1 amd64 [installed]
postgresql-client-11/bionic-pgdg,now 11.5-3.pgdg18.04+1 amd64 [installed,automatic]
postgresql-client-12/bionic-pgdg,now 12.0-2.pgdg18.04+1 amd64 [installed]
postgresql-client-common/bionic-pgdg,now 207.pgdg18.04+1 all [installed,automatic]
postgresql-common/bionic-pgdg,now 207.pgdg18.04+1 all [installed,automatic]
even I run the postgresql only in version 12.
2019-10-23 07:08:59 [administrator@pg_primary ~]$ sudo runuser -l postgres -c "psql postgres://pg_secondary:5432,pg_primary:5432/testdb1?target_session_attrs=read-write -c 'show server_version;'"
server_version
----------------------------------
12.0 (Ubuntu 12.0-2.pgdg18.04+1)
(1 row)
hope the offical packages will been released soon
I got following warnings when building man pages for 1.0.5 release
4 /citus-rpm-build/pg-auto-failover-enterprise-1.0.5/docs/index.rst:7: WARNING: Title underline too short.
4 /citus-rpm-build/pg-auto-failover-enterprise-1.0.5/docs/install.rst:115: WARNING: Error in "code-block" directive:
4 /citus-rpm-build/pg-auto-failover-enterprise-1.0.5/docs/operations.rst:87: WARNING: Unexpected indentation.
3 /citus-rpm-build/pg-auto-failover-enterprise-1.0.5/docs/operations.rst:93: WARNING: Block quote ends without a blank line; unexpected unindent.
4 /citus-rpm-build/pg-auto-failover-enterprise-1.0.5/docs/quickstart.rst:20: WARNING: Title underline too short.
1 looking for now-outdated files... /citus-rpm-build/pg-auto-failover-enterprise-1.0.5/docs/operations.rst:93: WARNING: Block quote ends without a blank line; unexpected unindent.
4 pg_autoctl.1 { } /citus-rpm-build/pg-auto-failover-enterprise-1.0.5/docs/ref/reference.rst:356: WARNING: undefined label: _pg_autoctl_create_monitor (if the link has no caption the label must precede a section header)
4 writing... pg_auto_failover.1 { intro quickstart quickstart-cluster architecture fault-tolerance install ref/reference ref/configuration operations fsm } /citus-rpm-build/pg-auto-failover-enterprise-1.0.5/docs/ref/reference.rst:356: WARNING: undefined label: _pg_autoctl_create_monitor (if the link has no caption the label must precede a section header)
Logging it here to be fixed for the next version
[postgres@test1 ~]$ pg_autoctl create monitor --pgdata ./pgdata
10:59:54 INFO Initialising a PostgreSQL cluster at "./pgdata"
10:59:54 INFO Now using absolute pgdata value "/home/postgres/pgdata" in the configuration
10:59:54 INFO /usr/pgsql-11/bin/pg_ctl --pgdata ./pgdata --options "-p 5432" --options "-h *" --wait start
10:59:54 ERROR Failed to parse pg_controldata output
10:59:54 FATAL Failed to discover PostgreSQL setup, please fix previous errors.
10:59:54 FATAL Failed to initialise a monitor node, see above for details
I'm running the monitor and the master in two Docker container on a container orchestrator. The first deployment works great. The master registers itself on the monitor node. But as soon as I kill the master node container and start it again, it can't register itself anymore and gives me this error.
16:47:05 ERROR Failed to execute "SELECT * FROM pgautofailover.register_node($1, $2, $3, $4, $5, $6::pgautofailover.replication_state, $7)": ERROR: duplicate key value violates unique constraint "node_nodename_nodeport_key"
DETAIL: Key (nodename, nodeport)=(_mymasternodename_, 5432) already exists.
CONTEXT: SQL statement "INSERT INTO pgautofailover.node (formationid, groupid, nodename, nodeport, goalstate, reportedstate) VALUES ($1, $2, $3, $4, $5, $6) RETURNING nodeid"
I start my node with the following command :
su postgres -c 'pg_autoctl create postgres --pgctl /usr/lib/postgresql/11/bin/pg_ctl --pgdata /home/postgres/data --monitor postgres://autoctl_node@_mymonitornodename_:5432/pg_auto_failover --nodename _mymasternodename_'
When executing create postgres on secondary node it fails due to mounted PGDATA folder.
ERROR Failed to remove directory "/opt/postgressql/pgdbdata": Device or resource busy
Full log:
ubuntu@ip-10-61-5-136:~$ sudo -u postgres -i pg_autoctl create postgres --pgdata /opt/postgressql/pgdbdata --monitor postgres://autoctl_node@MONITOR_NODE:5432/pg_auto_failover --nodename SECONDARY_NODE
15:13:01 INFO Registered node SECONDARY_NODE with id 16 in formation "default", group 0.
15:13:01 INFO Writing keeper init state file at "/var/lib/postgresql/.local/share/pg_autoctl/opt/postgressql/pgdbdata/pg_autoctl.init"
15:13:01 INFO Successfully registered as "wait_standby" to the monitor.
15:13:01 INFO FSM transition from "init" to "wait_standby": Start following a primary
15:13:01 INFO Transition complete: current state is now "wait_standby"
15:13:06 INFO FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby
15:13:06 INFO The primary node returned by the monitor is PRIMARY_NODE:5432
15:13:06 INFO Initialising PostgreSQL as a hot standby
15:13:06 INFO Target directory exists: "/opt/postgressql/pgdbdata", stopping PostgreSQL
15:13:06 INFO pg_ctl: no server running
15:13:06 INFO pg_ctl stop failed, but PostgreSQL is not running anyway
15:13:09 INFO Running /usr/lib/postgresql/11/bin/pg_basebackup -w -h PRIMARY_NODE -p 5432 --pgdata /opt/postgressql/backup -U pgautofailover_replicator --write-recovery-conf --max-rate 100M --wal-method=stream --slot pgautofailover_standby ...
15:18:27 INFO pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: CB/6D000028 on timeline 3
pg_basebackup: starting background WAL receiver
0/30758969 kB (0%), 0/1 tablespace (.../postgressql/backup/backup_label)
....
30759029/30759029 kB (100%), 0/1 tablespace (...gressql/backup/global/pg_control)
30759029/30759029 kB (100%), 1/1 tablespace
pg_basebackup: write-ahead log end point: CB/6D01E8C8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: base backup completed
could not remove file or directory "/opt/postgressql/pgdbdata": Device or resource busy
15:18:27 ERROR Failed to remove directory "/opt/postgressql/pgdbdata": Device or resource busy
15:18:27 ERROR Failed initialise standby server, see above for details
15:18:27 ERROR Failed to transition from state "wait_standby" to state "catchingup", see above.
after performing manual failover multiple times observed that pg_replication_slots
table has an entry with active = f
ideally we should not have any replication slots entry in the secondary node.
I already build a cluster with 3 nodes , and I want to simulate that one of the nodes were permanently failure and I want to join a new node to the cluster to replace the failure one.
so I shutdown one of the nodes , create a new node , and run the command
sudo runuser -l postgres -c "pg_autoctl create postgres --pgdata /var/lib/postgresql/11/main --pgport 5432 --nodename pg-third --monitor postgres://autoctl_node@monitor:5432/pg_auto_failover"
here comes the error
14:38:03 ERROR Failed to execute "SELECT * FROM pgautofailover.register_node($1, $2, $3, $4, $5, $6::pgautofailover.replication_state, $7)": ERROR: group 0 already has 2 members
14:38:03 ERROR Failed to register node pg-third:5432 in group 0 of formation "default" with initial state "init", see previous lines for details
14:38:03 ERROR Failed to register the existing local Postgres node "pg-third:5432" running at "/var/lib/postgresql/11/main"to the pg_auto_failover monitor at postgres://autoctl_node@monitor:5432/pg_auto_failover, see above for details
maybe the process should be remove a node using pg_autoctl drop node first , then join the new node , but that command can only drop the host self, not able to drop other node in cluster . BUT , the node need to be drop was already dead and unrecoverable , then how to drop the dead node and join a new node into the cluster ?
We are getting packaging warnings about missing man page.
We should consider adding it.
Hi,
I just followed the README to setup the three nodes with PostgreSQL 12:
postgres@pg-af1:/u02/pgdata/12/PG1/ [PG1] pg_autoctl show state
Name | Port | Group | Node | Current State | Assigned State
---------------------------+--------+-------+-------+-------------------+------------------
pg-af1.it.dbi-services.com | 5432 | 0 | 1 | wait_primary | wait_primary
pg-af2.it.dbi-services.com | 5432 | 0 | 2 | catchingup | catchingup
What does the "catchingup" state really mean? I'd expected something like "in sync". When checking synchronous_standby_names I'd expected to see synchronous replication which apparently is not the case:
postgres@pg-af1:/u02/pgdata/12/PG1/ [PG1] sq
psql (12.0 dbi services build)
Type "help" for help.
postgres=# show synchronous_standby_names ;
synchronous_standby_names
---------------------------
(1 row)
Am I missing something? Replication in general is fine:
postgres=# select * from pg_stat_replication ;
-[ RECORD 1 ]----+------------------------------
pid | 8336
usesysid | 16385
usename | pgautofailover_replicator
application_name | walreceiver
client_addr | 192.168.22.71
client_hostname | pg-af2.it.dbi-services.com
client_port | 48680
backend_start | 2019-10-19 15:39:00.508099+02
backend_xmin |
state | streaming
sent_lsn | 0/30017A0
write_lsn | 0/30017A0
flush_lsn | 0/30017A0
replay_lsn | 0/30017A0
write_lag |
flush_lag |
replay_lag |
sync_priority | 0
sync_state | async
reply_time | 2019-10-19 15:49:16.594248+02
Thanks
Daniel
These instructions assume you have $VERSION
, $PROJECT
, and $REPO
environment variables set in your shell (e.g. 1.0.0
, pgautofailover
, and pg_auto_failover
). With those set, code from most steps can be copy-pasted.
release-x.y
branch. git log --cherry-pick --no-merges release-x.y...master
can be helpful. Be sure to cherry-pick changes in the same order they were merged to the main branch (but do not cherry-pick merge commits themselves)CHANGELOG
entry in the master
release-x.y
branch summarizing meaningful changes ( will merge release branch into master after releasing to update the changelog there )defaults.h
git tag -a -s v$VERSION
to create an annotated, signed tag for the release. Summarize the release in the one-line tag annotation (beneath 52 characters). Push the tag with git push origin v$VERSION
open https://github.com/citusdata/$REPO/releases
)
v$VERSION
). Leave the description blank (it will auto-fill with the tag description)As for signing tags, use your own key, and ensure it's known to GitHub.
debian-$PROJECT
branch of the packaging repository; create a new branch for your changes
$VERSION
, stable
) to the debian/changelog
filepkglatest
variable in the pkgvars
file to $VERSION
citus_package -p=debian/jessie -p=debian/stretch -p=ubuntu/bionic -p=ubuntu/xenial -p=ubuntu/trusty local release-$VERSION 2>&1 | tee citus_package.log
citus_package -p=debian/jessie -p=debian/stretch -p=ubuntu/bionic -p=ubuntu/xenial -p=ubuntu/trusty local release 2>&1 | tee citus_package.log
grep -Ei '(warning|\bi|\be|\bw):' citus_package.log | sort | uniq -c
. Ignore any warnings about using a gain-root-command while being root or Recognised distributionsredhat-$PROJECT
branch of the packaging repository; create a new branch for your changes
pkglatest
variable in the pkgvars
file to $VERSION
$PROJECT.spec
file, being sure to:
Version:
fieldSource0:
field$VERSION
) to the %changelog
sectioncitus_package -p=el/7 -p=el/6 -p=ol/7 -p=ol/6 local release-$VERSION 2>&1 | tee citus_package.log
citus_package -p=el/7 -p=el/6 -p=ol/7 -p=ol/6 local release 2>&1 | tee citus_package.log
grep -Ei '(warning|\bi|\be|\bw):' citus_package.log | sort | uniq -c | grep -v "\.c"
. Ignore any errors about --disable-dependency-tracking
curl https://install.citusdata.com/community/deb.sh | sudo bash
Encouraging people to blindly execute content from the internet as root is as dangerous as it is convenient - on both security grounds and failure modes. Please provide alternate default instructions to add your repos.
(I hasten to add that this project looks awesome and I’ll be looking to see if we can drop some homegrown equivalents - thank you for opening this!)
Hi,
pg_auto_failover cannot be built against PostgreSQL 12:
+ PG_CONFIG=/usr/pgsql-12/bin/pg_config
+ /usr/bin/make -j3
ipaddr.c: In function 'findHostnameFromLocalIpAddress':
ipaddr.c:548:12: warning: unused variable 'address_len' [-Wunused-variable]
socklen_t address_len;
^~~~~~~~~~~
ipaddr.c:547:26: warning: unused variable 'address' [-Wunused-variable]
struct sockaddr_storage address;
^~~~~~~
health_check_worker.c: In function 'HealthCheckWorkerLauncherMain':
health_check_worker.c:233:20: error: implicit declaration of function 'AllocSetContextCreateExtended'; did you mean 'AllocSetContextCreateInternal'? [-Werror=implicit-function-declaration]
launcherContext = AllocSetContextCreateExtended(CurrentMemoryContext,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
AllocSetContextCreateInternal
health_check_worker.c:233:18: error: assignment to 'MemoryContext' {aka 'struct MemoryContextData *'} from 'int' makes pointer from integer without a cast [-Werror=int-conversion]
launcherContext = AllocSetContextCreateExtended(CurrentMemoryContext,
^
health_check_worker.c: In function 'BuildDatabaseList':
health_check_worker.c:363:9: error: implicit declaration of function 'heap_beginscan_catalog'; did you mean 'table_beginscan_catalog'? [-Werror=implicit-function-declaration]
scan = heap_beginscan_catalog(pgDatabaseRelation, 0, NULL);
^~~~~~~~~~~~~~~~~~~~~~
table_beginscan_catalog
health_check_worker.c:363:7: error: assignment to 'HeapScanDesc' {aka 'struct HeapScanDescData *'} from 'int' makes pointer from integer without a cast [-Werror=int-conversion]
scan = heap_beginscan_catalog(pgDatabaseRelation, 0, NULL);
^
In file included from /usr/pgsql-12/include/server/postgres.h:46,
from health_check_worker.c:13:
health_check_worker.c:365:49: error: passing argument 1 of 'heap_getnext' from incompatible pointer type [-Werror=incompatible-pointer-types]
while (HeapTupleIsValid(dbTuple = heap_getnext(scan, ForwardScanDirection)))
^~~~
/usr/pgsql-12/include/server/c.h:626:48: note: in definition of macro 'PointerIsValid'
#define PointerIsValid(pointer) ((const void*)(pointer) != NULL)
^~~~~~~
health_check_worker.c:365:9: note: in expansion of macro 'HeapTupleIsValid'
while (HeapTupleIsValid(dbTuple = heap_getnext(scan, ForwardScanDirection)))
^~~~~~~~~~~~~~~~
In file included from health_check_worker.c:21:
/usr/pgsql-12/include/server/access/heapam.h:120:45: note: expected 'TableScanDesc' {aka 'struct TableScanDescData *'} but argument is of type 'HeapScanDesc' {aka 'struct HeapScanDescData *'}
extern HeapTuple heap_getnext(TableScanDesc scan, ScanDirection direction);
~~~~~~~~~~~~~~^~~~
cli_config.c: In function 'cli_config_check':
cli_config.c:82:6: warning: unused variable 'errors' [-Wunused-variable]
int errors = 0;
^~~~~~
health_check_worker.c:378:19: error: implicit declaration of function 'HeapTupleGetOid'; did you mean 'HeapTupleIsValid'? [-Werror=implicit-function-declaration]
entry->dboid = HeapTupleGetOid(dbTuple);
^~~~~~~~~~~~~~~
HeapTupleIsValid
health_check_worker.c:387:15: error: passing argument 1 of 'heap_endscan' from incompatible pointer type [-Werror=incompatible-pointer-types]
heap_endscan(scan);
^~~~
In file included from health_check_worker.c:21:
/usr/pgsql-12/include/server/access/heapam.h:119:40: note: expected 'TableScanDesc' {aka 'struct TableScanDescData *'} but argument is of type 'HeapScanDesc' {aka 'struct HeapScanDescData *'}
extern void heap_endscan(TableScanDesc scan);
~~~~~~~~~~~~~~^~~~
health_check_worker.c: In function 'HealthCheckWorkerMain':
health_check_worker.c:452:21: error: assignment to 'MemoryContext' {aka 'struct MemoryContextData *'} from 'int' makes pointer from integer without a cast [-Werror=int-conversion]
healthCheckContext = AllocSetContextCreateExtended(CurrentMemoryContext,
^
cli_service.c: In function 'cli_monitor_run':
cli_service.c:161:16: warning: unused variable 'existingPgSetup' [-Wunused-variable]
PostgresSetup existingPgSetup = { 0 };
^~~~~~~~~~~~~~~
cli_service.c:155:7: warning: unused variable 'configFilePath' [-Wunused-variable]
char configFilePath[MAXPGPATH];
^~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[2]: *** [<builtin>: health_check_worker.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [Makefile:16: monitor] Error 2
make[1]: *** Waiting for unfinished jobs....
cli_show.c: In function 'cli_show_uri':
cli_show.c:309:10: warning: unused variable 'monitor' [-Wunused-variable]
Monitor monitor = { 0 };
^~~~~~~
cli_create_drop_node.c: In function 'cli_create_monitor':
cli_create_drop_node.c:427:7: warning: unused variable 'ipAddr' [-Wunused-variable]
char ipAddr[BUFSIZE];
^~~~~~
cli_create_drop_node.c:425:16: warning: unused variable 'pgSetup' [-Wunused-variable]
PostgresSetup pgSetup = config.pgSetup;
^~~~~~~
cli_root.c: In function 'root_options':
cli_root.c:114:6: warning: unused variable 'debugCount' [-Wunused-variable]
int debugCount = 0;
^~~~~~~~~~
pgsetup.c: In function 'get_pgpid':
pgsetup.c:403:7: warning: unused variable 'line' [-Wunused-variable]
char line[BUFSIZE];
^~~~
pgsetup.c:402:6: warning: unused variable 'lineno' [-Wunused-variable]
int lineno;
^~~~~~
cli_do_root.c: In function 'keeper_cli_keeper_setup_getopts':
cli_do_root.c:231:23: warning: unused variable 'errors' [-Wunused-variable]
int c, option_index, errors = 0;
^~~~~~
cli_do_root.c:231:9: warning: unused variable 'option_index' [-Wunused-variable]
int c, option_index, errors = 0;
^~~~~~~~~~~~
cli_do_root.c:231:6: warning: unused variable 'c' [-Wunused-variable]
int c, option_index, errors = 0;
^
monitor_config.c: In function 'monitor_config_set_setting':
monitor_config.c:442:13: warning: unused variable 'option' [-Wunused-variable]
IniOption *option;
^~~~~~
monitor_pg_init.c: In function 'monitor_pg_init':
monitor_pg_init.c:60:7: warning: unused variable 'postgresInstanceExists' [-Wunused-variable]
bool postgresInstanceExists = pg_setup_pgdata_exists(&pgSetup);
^~~~~~~~~~~~~~~~~~~~~~
Can you please take a look?
Thanks, Devrim
WaitForEvent
function heavily dependant on poll.h
include, but in all other places, PostgreSQL uses such include tecnhique:
#ifdef HAVE_POLL_H
#include <poll.h>
#endif
If we do define include in such a manner we got:
pasha@PG480 MINGW64 /src/pg_auto_failover
$ make
make -C src/monitor/ all
make[1]: Entering directory '/src/pg_auto_failover/src/monitor'
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -O2 -std=c99 -Wall -Werror -Wno-unused-parameter -Iinclude -IC:/Users/pasha/Code/POSTGR~1/msys64/usr/local/POSTGR~1/include -I. -I./ -IC:/Users/pasha/Code/POSTGR~1/msys64/usr/local/POSTGR~1/include/server -IC:/Users/pasha/Code/POSTGR~1/msys64/usr/local/POSTGR~1/include/internal -I./src/include/port/win32 -DEXEC_BACKEND -IC:/Users/pasha/Code/POSTGR~1/msys64/usr/local/POSTGR~1/include/server/port/win32 -DWIN32_STACK_RLIMIT=4194304 -c -o health_check_worker.o health_check_worker.c
health_check_worker.c: In function 'HealthCheckWorkerLauncherMain':
health_check_worker.c:292:44: error: passing argument 2 of 'WaitForBackgroundWorkerStartup' from incompatible pointer type [-Werror=incompatible-pointer-types]
292 | WaitForBackgroundWorkerStartup(handle, &pid);
| ^~~~
| |
| int *
In file included from health_check_worker.c:28:
C:/Users/pasha/Code/POSTGR~1/msys64/usr/local/POSTGR~1/include/server/postmaster/bgworker.h:123:94: note: expected 'pid_t *' {aka 'long long int *'} but argument is of type 'int *'
123 | extern BgwHandleStatus WaitForBackgroundWorkerStartup(BackgroundWorkerHandle *handle, pid_t *pid);
| ~~~~~~~^~~
health_check_worker.c: In function 'WaitForEvent':
health_check_worker.c:637:64: error: invalid application of 'sizeof' to incomplete type 'struct pollfd'
637 | pollFDs = (struct pollfd *) palloc0(healthCheckCount * sizeof(struct pollfd));
| ^~~~~~
health_check_worker.c:644:47: error: invalid use of undefined type 'struct pollfd'
644 | struct pollfd *pollFileDescriptor = &pollFDs[healthCheckIndex];
| ^
health_check_worker.c:644:47: error: dereferencing pointer to incomplete type 'struct pollfd'
health_check_worker.c:670:21: error: 'POLLERR' undeclared (first use in this function)
670 | pollEventMask = POLLERR | POLLIN;
| ^~~~~~~
health_check_worker.c:670:21: note: each undeclared identifier is reported only once for each function it appears in
health_check_worker.c:670:31: error: 'POLLIN' undeclared (first use in this function); did you mean 'LPOLELINK'?
670 | pollEventMask = POLLERR | POLLIN;
| ^~~~~~
| LPOLELINK
health_check_worker.c:674:31: error: 'POLLOUT' undeclared (first use in this function)
674 | pollEventMask = POLLERR | POLLOUT;
| ^~~~~~~
health_check_worker.c:694:15: error: implicit declaration of function 'poll' [-Werror=implicit-function-declaration]
694 | pollResult = poll(pollFDs, healthCheckCount, pollTimeout);
| ^~~~
health_check_worker.c:706:47: error: invalid use of undefined type 'struct pollfd'
706 | struct pollfd *pollFileDescriptor = &pollFDs[healthCheckIndex];
| ^
cc1.exe: all warnings being treated as errors
make[1]: *** [<builtin>: health_check_worker.o] Error 1
make[1]: Leaving directory '/src/pg_auto_failover/src/monitor'
make: *** [Makefile:16: monitor] Error 2
I think we need to rewrite that particular piece of code to be crosscompatible
getting below error when trying to setup secondary instance :
[postgres@lxlab1569252301 postgres_log]$ pg_autoctl create postgres --pgdata /postgres_data --monitor postgres://[email protected]:50001/pg_auto_failover
14:10:38 INFO Using --nodename "lxlab1569252301.devlab.lowes.com", which resolves to IP address "172.29.142.23"
14:10:38 INFO Continuing from a previous `pg_autoctl create` failed attempt
14:10:38 INFO PostgreSQL state at registration time was: PGDATA does not exists
14:10:38 INFO FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby
14:10:38 INFO The primary node returned by the monitor is lxlab1569252313.devlab.lowes.com:50001
14:10:38 INFO Initialising PostgreSQL as a hot standby
14:10:38 INFO Target directory exists: "/postgres_data", stopping PostgreSQL
14:10:38 INFO pg_ctl: no server running
14:10:38 INFO pg_ctl stop failed, but PostgreSQL is not running anyway
could not remove file or directory "//backup": Permission denied
14:10:38 ERROR Failed to remove directory "//backup": Permission denied
14:10:38 ERROR Failed initialise standby server, see above for details
14:10:38 ERROR Failed to transition from state "wait_standby" to state "catchingup", see above.
i dont know from its checking /backup folder .
successfully created monitor and primary instance.
Version is 11.2 and Os types is linux x86_64.
Followed below link for setup
https://github.com/citusdata/pg_auto_failover
Needs this quick and dirty patch to compile and link successfully:
diff --git a/src/bin/pg_autoctl/Makefile b/src/bin/pg_autoctl/Makefile
index 64b2ee7..d0d25a8 100644
--- a/src/bin/pg_autoctl/Makefile
+++ b/src/bin/pg_autoctl/Makefile
@@ -33,7 +33,7 @@ CFLAGS += $(COMMON_LIBS)
LIBS = -L $(shell $(PG_CONFIG) --libdir)
LIBS += -L $(shell $(PG_CONFIG) --pkglibdir)
-LIBS += -lpq -lpgport -lpgcommon
+LIBS += -lpq -lpgport -lpgcommon -lintl
all: $(PG_AUTOCTL) ;
diff --git a/src/bin/pg_autoctl/cli_create_drop_node.c b/src/bin/pg_autoctl/cli_create_drop_node.c
index 637dafd..e696650 100644
--- a/src/bin/pg_autoctl/cli_create_drop_node.c
+++ b/src/bin/pg_autoctl/cli_create_drop_node.c
@@ -13,6 +13,7 @@
#include <getopt.h>
#include <signal.h>
#include <string.h>
+#include <arpa/inet.h>
#include "postgres_fe.h"
diff --git a/src/bin/pg_autoctl/keeper.c b/src/bin/pg_autoctl/keeper.c
index 901be36..9cf2714 100644
--- a/src/bin/pg_autoctl/keeper.c
+++ b/src/bin/pg_autoctl/keeper.c
@@ -7,6 +7,7 @@
*
*/
#include <fcntl.h>
+#include <sys/stat.h>
#include <inttypes.h>
#include <time.h>
#include <stdlib.h>
diff --git a/src/bin/pg_autoctl/state.c b/src/bin/pg_autoctl/state.c
index 435a673..5cc87ec 100644
--- a/src/bin/pg_autoctl/state.c
+++ b/src/bin/pg_autoctl/state.c
@@ -8,6 +8,7 @@
*/
#include <fcntl.h>
+#include <sys/stat.h>
#include <inttypes.h>
#include <stdlib.h>
#include <string.h>
Detected operating system as debian/buster.
Unfortunately, your operating system distribution and version are not supported by this script.
Since it's out for a few months now, it would be great if packages were built :)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.