zalando / patroni Goto Github PK

View Code? Open in Web Editor NEW

6.5K 125.0 827.0 8.78 MB

A template for PostgreSQL High Availability with Etcd, Consul, ZooKeeper, or Kubernetes

License: MIT License

Python 96.76% Shell 0.42% Gherkin 2.20% Dockerfile 0.61%

postgresql high-availability etcd zookeeper consul raft failover haproxy kubernetes

patroni's Introduction

Patroni: A Template for PostgreSQL HA with ZooKeeper, etcd or Consul

You can find a version of this documentation that is searchable and also easier to navigate at patroni.readthedocs.io.

There are many ways to run high availability with PostgreSQL; for a list, see the PostgreSQL Documentation.

Patroni is a template for high availability (HA) PostgreSQL solutions using Python. For maximum accessibility, Patroni supports a variety of distributed configuration stores like ZooKeeper, etcd, Consul or Kubernetes. Database engineers, DBAs, DevOps engineers, and SREs who are looking to quickly deploy HA PostgreSQL in datacenters - or anywhere else - will hopefully find it useful.

We call Patroni a "template" because it is far from being a one-size-fits-all or plug-and-play replication system. It will have its own caveats. Use wisely.

Currently supported PostgreSQL versions: 9.3 to 16.

Note to Citus users: Starting from 3.0 Patroni nicely integrates with the Citus database extension to Postgres. Please check the Citus support page in the Patroni documentation for more info about how to use Patroni high availability together with a Citus distributed cluster.

Note to Kubernetes users: Patroni can run natively on top of Kubernetes. Take a look at the Kubernetes chapter of the Patroni documentation.

How Patroni Works
Development Status
Community
Technical Requirements/Installation
Running and Configuring
YAML Configuration
Environment Configuration
Replication Choices
Applications Should Not Use Superusers

How Patroni Works

Patroni (formerly known as Zalando's Patroni) originated as a fork of Governor, the project from Compose. It includes plenty of new features.

For additional background info, see:

Elephants on Automatic: HA Clustered PostgreSQL with Helm, talk by Josh Berkus and Oleksii Kliukin at KubeCon Berlin 2017
PostgreSQL HA with Kubernetes and Patroni, talk by Josh Berkus at KubeCon 2016 (video)
Feb. 2016 Zalando Tech blog post

Development Status

Patroni is in active development and accepts contributions. See our Contributing section below for more details.

We report new releases information here.

Community

There are two places to connect with the Patroni community: on github, via Issues and PRs, and on channel #patroni in the PostgreSQL Slack. If you're using Patroni, or just interested, please join us.

Technical Requirements/Installation

Pre-requirements for Mac OS

To install requirements on a Mac, run the following:

brew install postgresql etcd haproxy libyaml python

Psycopg

Starting from psycopg2-2.8 the binary version of psycopg2 will no longer be installed by default. Installing it from the source code requires C compiler and postgres+python dev packages. Since in the python world it is not possible to specify dependency as psycopg2 OR psycopg2-binary you will have to decide how to install it.

There are a few options available:

Use the package manager from your distro

sudo apt-get install python3-psycopg2  # install psycopg2 module on Debian/Ubuntu
sudo yum install python3-psycopg2      # install psycopg2 on RedHat/Fedora/CentOS

Specify one of psycopg, psycopg2, or psycopg2-binary in the list of dependencies when installing Patroni with pip (see below).

General installation for pip

Patroni can be installed with pip:

pip install patroni[dependencies]

where dependencies can be either empty, or consist of one or more of the following:

etcd or etcd3: python-etcd module in order to use Etcd as DCS
consul: python-consul module in order to use Consul as DCS
zookeeper: kazoo module in order to use Zookeeper as DCS
exhibitor: kazoo module in order to use Exhibitor as DCS (same dependencies as for Zookeeper)
kubernetes: kubernetes module in order to use Kubernetes as DCS in Patroni
raft: pysyncobj module in order to use python Raft implementation as DCS
aws: boto3 in order to use AWS callbacks
all: all of the above (except psycopg family)
psycopg3: psycopg[binary]>=3.0.0 module
psycopg2: psycopg2>=2.5.4 module
psycopg2-binary: psycopg2-binary module

For example, the command in order to install Patroni together with psycopg3, dependencies for Etcd as a DCS, and AWS callbacks is:

pip install patroni[psycopg3,etcd3,aws]

Note that external tools to call in the replica creation or custom bootstrap scripts (i.e. WAL-E) should be installed independently of Patroni.

Running and Configuring

To get started, do the following from different terminals:

> etcd --data-dir=data/etcd --enable-v2=true
> ./patroni.py postgres0.yml
> ./patroni.py postgres1.yml

You will then see a high-availability cluster start up. Test different settings in the YAML files to see how the cluster's behavior changes. Kill some of the components to see how the system behaves.

Add more postgres*.yml files to create an even larger cluster.

Patroni provides an HAProxy configuration, which will give your application a single endpoint for connecting to the cluster's leader. To configure, run:

> haproxy -f haproxy.cfg

> psql --host 127.0.0.1 --port 5000 postgres

YAML Configuration

Go here for comprehensive information about settings for etcd, consul, and ZooKeeper. And for an example, see postgres0.yml.

Environment Configuration

Go here for comprehensive information about configuring(overriding) settings via environment variables.

Replication Choices

Patroni uses Postgres' streaming replication, which is asynchronous by default. Patroni's asynchronous replication configuration allows for maximum_lag_on_failover settings. This setting ensures failover will not occur if a follower is more than a certain number of bytes behind the leader. This setting should be increased or decreased based on business requirements. It's also possible to use synchronous replication for better durability guarantees. See replication modes documentation for details.

Applications Should Not Use Superusers

When connecting from an application, always use a non-superuser. Patroni requires access to the database to function properly. By using a superuser from an application, you can potentially use the entire connection pool, including the connections reserved for superusers, with the superuser_reserved_connections setting. If Patroni cannot access the Primary because the connection pool is full, behavior will be undesirable.

patroni's People

Contributors

Stargazers

Watchers

Forkers

pgexperts radek-senfeld jinty sean- gitinsky drnic ww-it procore jankeirse schatt jollyturns howardzhang woltere clickfreak syaroslavtsev ants mikebd zenitram cartodb ender74 mbril sulion aslanbakirov curtisstpierre servicemax-engineering samdai rnz sm97 rimusz-lab alexeyklyukin felipeac maserg vanife anthony-lawson boazjohn strogo letslego skurik blunney1 smkingsoft crudbug illume angelopoerio maniacs-ops mbrukman jamessewell koskv viki00 edib abkfenris zhuomingliang 40a joar hitul007 kangkot hellblazer vdt neo4reo saityucel zoosky cartertsai bussiere sjoerdapp khageshpatel cuulee empia ph223 mylearning2017 montazze valeriap snooc ifarhankhan open-source-archive samychalah nstott martyanov olpop zmyer healthsamurai obellagamba pierrefh belonesox fastjames anodino-dev gridl ainlolcat deathborn tom2jack daggerok aland-zhang jberkus logorn db4u joe2hpimn kmk-online eino-makitalo vieyahn hongshunyang kelvinni mrgoogol

patroni's Issues

Error out in attempting to start first replica with current HEAD

I'm trying to test the code which merges in the basebackup back to postgresql.py. However, when I try to start the first replica, I get this error:

2015-10-22 18:49:34,852 INFO: Starting new HTTP connection (1): 172.17.0.5
pg_ctl: directory "data/postgresql2" is not a database cluster directory
Traceback (most recent call last):
File "./patroni.py", line 6, in
main()
File "/home/josh/patroni/patroni/init.py", line 69, in main
patroni.run()
File "/home/josh/patroni/patroni/init.py", line 50, in run
logger.info(self.ha.run_cycle())
File "/home/josh/patroni/patroni/ha.py", line 414, in run_cycle
return self._run_cycle()
File "/home/josh/patroni/patroni/ha.py", line 364, in _run_cycle
self.touch_member()
File "/home/josh/patroni/patroni/ha.py", line 53, in touch_member
'state': self.state_handler.state,
AttributeError: Postgresql instance has no attribute 'state'

It seems that it's not correctly detecting that there's an empty data dir, and I can't quite figure out why. This wasn't happening before I merged the most recent HEAD.

supress error output on master check

Doing a failover test, I got this output from the replica:

2015-10-28 11:49:41,781 ERROR: request failed: GET http://127.0.0.1:8008/patroni
Traceback (most recent call last):
File "/home/josh/patroni/patroni/ha.py", line 147, in fetch_node_status
response = requests.get(member.api_url, timeout=2, verify=False)
File "/home/josh/patronienv/local/lib/python2.7/site-packages/requests/api.py", line 69, in get
return request('get', url, params=params, *_kwargs)
File "/home/josh/patronienv/local/lib/python2.7/site-packages/requests/api.py", line 50, in request
response = session.request(method=method, url=url, *_kwargs)
File "/home/josh/patronienv/local/lib/python2.7/site-packages/requests/sessions.py", line 468, in request
resp = self.send(prep, *_send_kwargs)
File "/home/josh/patronienv/local/lib/python2.7/site-packages/requests/sessions.py", line 576, in send
r = adapter.send(request, *_kwargs)
File "/home/josh/patronienv/local/lib/python2.7/site-packages/requests/adapters.py", line 423, in send
raise ConnectionError(e, request=request)
ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=8008): Max retries exceeded with url: /patroni (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7ff78ea0efd0>: Failed to establish a new connection: [Errno 111] Connection refused',))

What's happening there is that the replica is failing to connect to the master because it's down. Which is as it should be; so why are we getting an error with a trace dump?

Patroni keeps the leader in read-only mode after the PostgreSQL crash

Steps to reproduce

run patroni cluster with a master and a replica
kill the master with -9
wait until the master will recover:

waiting for server to start....LOG: database system was interrupted; last known up at 2015-10-13 11:27:23 CEST
WARNING: recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
HINT: The database server will regularly poll the pg_xlog subdirectory to check for files placed there.
LOG: entering standby mode
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 0/40001D8
LOG: invalid record length at 0/40002B8
LOG: consistent recovery state reached at 0/40002B8
LOG: database system is ready to accept read only connections
done
server started
2015-10-13 11:28:24,422 INFO: started as readonly because i had the session lock
2015-10-13 11:28:24,423 INFO: Lock owner: postgresql0; I am postgresql0
2015-10-13 11:28:24,428 INFO: no action. i am the leader with the lock

try connecting to the master and running pg_is_in_recovery(), the master will stay in the read-only mode indefinitely.

I think we are not clearing the role flag after the crash, will look into today....

Ex-master cannot return come back after net-split

Hello, I am testing different failover cases with patroni and I have some issues with subject.

ex-master have different xlog_location and cannot achieve new-master
Additional info:
from zookeper:
ex-master:

{
    "conn_url": "postgres://replicator:[email protected]:5432/postgres",
    "api_url": "http://172.17.0.94:8019/patroni",
    "tags": {},
    "state": "running",
    "role": "replica",
    "xlog_location": 167772304
}

new-master:

{
    "conn_url": "postgres://replicator:[email protected]:5432/postgres",
    "api_url": "http://172.17.0.98:8059/patroni",
    "tags": {},
    "state": "running",
    "role": "master",
    "xlog_location": 318767632
}

And there is no record for ex-master as replica of new-master
select * from pg_stat_replication; (on new-master)

6465    16384   replicator  walreceiver 172.17.0.102        47561   2015-11-17 18:34:09     streaming   0/13000060  0/13000060  0/13000060  0/13000060  1   sync
2368    16384   replicator  walreceiver 172.17.0.97     36181   2015-11-17 18:24:23     streaming   0/13000060  0/13000060  0/13000060  0/13000060  1   potential
2374    16384   replicator  walreceiver 172.17.0.95     57300   2015-11-17 18:24:23     streaming   0/13000060  0/13000060  0/13000060  0/13000060  1   potential
2373    16384   replicator  walreceiver 172.17.0.100        53910   2015-11-17 18:24:23     streaming   0/13000060  0/13000060  0/13000060  0/13000060  1   potential

and in the end some logs from ex-master after net split

2015-11-17 15:30:27,512 INFO: Zookeeper session lost, state: EXPIRED_SESSION
2015-11-17 15:30:28,227 INFO: Connecting to 172.17.0.93:2181
2015-11-17 15:30:28,230 INFO: Zookeeper connection established, state: CONNECTED
172.17.0.101 - - [17/Nov/2015 15:30:28] "OPTIONS / HTTP/1.0" 503 -
2015-11-17 15:30:28,916 INFO: Lock owner: node5; I am node1
2015-11-17 15:30:28,916 INFO: does not have lock
2015-11-17 15:30:28,926 INFO: established a new patroni connection to the postgres cluster
2015-11-17 15:30:28,928 INFO: closed patroni connection to the postgresql cluster
2015-11-17 15:30:28,996 INFO: no action.  i am a secondary and i am following a leader
2015-11-17 15:30:29,014 INFO: Lock owner: node5; I am node1
2015-11-17 15:30:29,014 INFO: changing primary_conninfo and restarting in progress
LOG:  received fast shutdown request
waiting for server to shut down...LOG:  aborting any active transactions
.FATAL:  terminating connection due to administrator command
LOG:  shutting down
LOG:  database system is shut down
 done
server stopped
waiting for server to start....LOG:  database system was shut down in recovery at 2015-11-17 15:30:29 UTC
LOG:  entering standby mode
LOG:  consistent recovery state reached at 0/A000090
LOG:  record with zero length at 0/A000090
LOG:  database system is ready to accept read only connections
LOG:  fetching timeline history file for timeline 2 from primary server
LOG:  fetching timeline history file for timeline 3 from primary server
FATAL:  could not start WAL streaming: ERROR:  requested starting point 0/A000000 on timeline 1 is not in this server's history
    DETAIL:  This server's history forked from timeline 1 at 0/9CC0688.

LOG:  new timeline 3 forked off current database system timeline 1 before current recovery point 0/A000090
2015-11-17 15:30:30,909 ERROR: get_postgresql_status
Traceback (most recent call last):
  File "build/bdist.linux-x86_64/egg/patroni/api.py", line 231, in get_postgresql_status
    pg_is_in_recovery() AND pg_is_xlog_replay_paused()""", retry=retry)[0]
  File "build/bdist.linux-x86_64/egg/patroni/api.py", line 217, in query
    return self.server.query(sql, *params)
  File "build/bdist.linux-x86_64/egg/patroni/api.py", line 287, in query
    raise PostgresConnectionException('connection problems')
PostgresConnectionException: 'connection problems'
172.17.0.101 - - [17/Nov/2015 15:30:30] "OPTIONS / HTTP/1.0" 503 -
 done
server started
172.17.0.101 - - [17/Nov/2015 15:30:32] "OPTIONS / HTTP/1.0" 503 -
172.17.0.101 - - [17/Nov/2015 15:30:34] "OPTIONS / HTTP/1.0" 503 -
172.17.0.101 - - [17/Nov/2015 15:30:36] "OPTIONS / HTTP/1.0" 503 -
172.17.0.101 - - [17/Nov/2015 15:30:38] "OPTIONS / HTTP/1.0" 503 -
2015-11-17 15:30:38,997 INFO: established a new patroni connection to the postgres cluster
2015-11-17 15:30:39,009 INFO: Lock owner: node5; I am node1
2015-11-17 15:30:39,009 INFO: does not have lock
2015-11-17 15:30:39,013 INFO: no action.  i am a secondary and i am following a leader
172.17.0.101 - - [17/Nov/2015 15:30:40] "OPTIONS / HTTP/1.0" 503 -
172.17.0.101 - - [17/Nov/2015 15:30:42] "OPTIONS / HTTP/1.0" 503 -
172.17.0.101 - - [17/Nov/2015 15:30:44] "OPTIONS / HTTP/1.0" 503 -
LOG:  new timeline 3 forked off current database system timeline 1 before current recovery point 0/A000090
FATAL:  could not start WAL streaming: ERROR:  requested starting point 0/A000000 on timeline 1 is not in this server's history
    DETAIL:  This server's history forked from timeline 1 at 0/9CC0688.

Run pg_rewind for the former master that was shut down uncleanly (i.e. killed -9)

Currently, pg_rewind can only process the former master being shut down cleanly. This is useful for the case of split brain/connection loss between the former master and DCS, but In real life some of the shutdowns will happen because of the PostgreSQL crash, due to bad hardware or bugs.

What we could do is to start the node that crashed in a single user mode and issue a checkpoint before trying to do pg_rewind on it. We have to check, though, that no data loss occurs as a result.

add ability to have custom postgresql.conf. pg_hba.conf

Add the following lines to patroni.yml:

postgresql_config_file: /path/to/custom/file
pg_hba_file: /path/to/custom/file

If present, these configuration variables will cause the files named to be copied to the data directory after it is initialized and before PostgreSQL is started up.

A TODO for me.

Add simple HTTP check for "is master"

We need a simple HTTP check for HAProxy load balancing (and others) which checks if the server whose API is queried is currently the master, without parsing JSON, using a URL like /ismaster/. This will allow using http-check expect in HAProxy to determine the master.

Make sense?

Add pgBouncer auto-config for 2-node systems

For systems with 1 master and 1 replica, add pgbouncer auto-configuration for rw and ro database handles.

This is a TODO item for me.

Automatically prefix localhost connections

So, not making much progress. When trying to start a master node I get this:

WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

/usr/lib/postgresql/9.4/bin/postgres -D /pgdata/datacd p

or
/usr/lib/postgresql/9.4/bin/pg_ctl -D /pgdata/datacd p -l logfile start

waiting for server to start....LOG: database system was shut down at 2016-01-08 01:15:51 UTC
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
done
server started
2016-01-08 01:15:52,558 INFO: established a new patroni connection to the postgres cluster
2016-01-08 01:15:53,411 INFO: established a new patroni connection to the postgres cluster
2016-01-08 01:15:54,164 INFO: established a new patroni connection to the postgres cluster
2016-01-08 01:15:55,267 INFO: established a new patroni connection to the postgres cluster
2016-01-08 01:15:56,309 INFO: established a new patroni connection to the postgres cluster
2016-01-08 01:15:56,311 INFO: removing initialize key after failed attempt to initialize the cluster
2016-01-08 01:15:56,320 INFO: renaming data directory to /pgdata/datacd p_2016-01-08-01-15-56
2016-01-08 01:15:56,321 ERROR: Error communicating with Postgresql. Will try again later
Traceback (most recent call last):

... and things fail from there. The thing is, PostgreSQL is running, as subsequent attempts to start it fail due to port 5432 already being bound.

Still can'

Remove connections to all other replicas when deciding failover?

Currently, is_healthiest_node() involves each replica connecting to every other replica to compare xlog posititions and find "furthest ahead". This is fine if you have, day, two replicas, but if you have 20 it's liable to result in a really long failover time.

My proposal for changing this is as follows:

each replica should publish its current position, timestamp, and apply lag to the state store.
in is_healthiest_node(), replicas will FIRST compare themselves against the positions in the state store
if the configuration has check_all_replicas, then it will connect to the other replicas in highest xloc order. Otherwise it will proceed based on the state_store information.

If we make sure that the publication interval for replicas is more frequent than the master key TTL, then the information about replay location should be fairly good, and the connections should be unnecessary.

This should greatly reduce the number of connections, and lag, in the event of a failure event.

Drawbacks: this will mean skipping the check for an existing master in the cluster if check_all is disabled.

Add healthcheck to the api

To enable easy monitoring it would be helpful to have an api endpoint providing a healthcheck for the cluster.

We should include:

status, enum? [healthy, degraded, unhealthy, dead]
Text message for interpretation for humans

Consider using better health checking on Zookeeper and Consul.

Zookeeper has ephemeral nodes, and Consul has health checks. Either of these mechanisms provides a means to detect master failure -- and even replica failure -- faster than via the polling method we need to use with etcd.

We should consider using this advanced mechanism on each of these SIS platforms.

Allow a cluster member to bootstrap in the absence of the leader

Previously, we could only use pg_basebackup to bootstrap the new replica, which obviously required a running master. Currently, we also have a possibility to use WAL-E, as well as user-supplied replica creation methods that don't require a running master, since the data is stored on S3 or any other external storage. We need to change the Patroni behavior to bootstrap even if the leader is not present, in case the replica creation method allows it. Since there is no way we can auto detect the ability of the replica creation method to create replicas without the running master, we should add a parameter to the method configuration, indicating that.

The result is that we can recover from stupid errors, like the case of someone just shutting down all Patroni instances at once.

Add tags to replicas to support many replicas

Replicas should have the following attributes available. This would be added as tags within the replica description, both in the API and in the SIS. If the tags are absent, default behavior remains the same; in the config, the tags below are set to True or False:

nofailover: this replica should not be a candidate for automatic failover. Used for ephemeral replicas, and multi-replica setups where there are a few designated failover targets. If the tag is missing, the replica is assumed to be a failover target.

noloadbalance: this replica should not be part of any load-balancing, if HAProxy autoconfig is being used.

clonefrom: take basebackups from this replica instead of from the master. If multiple CloneFrom replicals are available, one is randomly chosen for new basebackups. If cloning using WALE or other archival backup, this tag is ignored.

replicatefrom: not true or false, but the IP address/hostname of another replica. Used to support cascading replication.

Failed To Acquire Initialize Lock is still a problem

Folks,

I'm doing a lot of testing which involves masters which fail to start. Thing is, if this happens the initialize lock is left behind, and the only way to fix it is to manually delete the lock using curl.

There still needs to be a way that the initialize lock can be expired if there are no nodes in the cluster.

Support socket connections for local patroni instance

Currently, the local patroni instance can only connect using TCP. We should support local unix socket connections as well; that's both faster and more secure.

This will require some replumbing, which I regard as a cleanup anyway. Currently, we read config information, use that to create a URL, then parse that URL to get the config information back. This seems unnecessarily ornate; we should either use URLs in the config in the first place, or we should avoid this whole unnecessary step and use the config fields directly. For backwards-compatibility reasons, I plan to use the latter path.

I'll also use a file path for the socket interface if that's preferred by the user. It should be simple to differentiate this from an IPv4/v6 address.

I plan to do this in the next couple of days, so let me know if you have comments on the design.

Implement ENV secrets for PostgreSQL user accounts

We need a better way to set passwords for the three PostgreSQL accounts we need to use than keeping them in a configuration file which may be in poorly secured CMS repo. I suggest that we use envdir, as several other Python projects (like WAL-E) do.

Other ideas?

support 9.3 by allowing disabling replication slots

... this should be possible.

Current HEAD fails initialization

Using the patroni-compose project (https://github.com/pgexperts/patroni-compose):

dbnode1_1 | scope: &scope testcluster
dbnode1_1 | ttl: &ttl 30
dbnode1_1 | loop_wait: &loop_wait 10
dbnode1_1 | restapi:
dbnode1_1 | listen: 172.17.0.12:8001
dbnode1_1 | connect_address: 172.17.0.12:8001
dbnode1_1 | auth: 'admin:admin'
dbnode1_1 | certfile: /etc/ssl/certs/ssl-cert-snakeoil.pem
dbnode1_1 | keyfile: /etc/ssl/private/ssl-cert-snakeoil.key
dbnode1_1 | etcd:
dbnode1_1 | scope: _scope
dbnode1_1 | ttl: *ttl
dbnode1_1 | host: etcd:2379
dbnode1_1 | postgresql:
dbnode1_1 | name: dbnode1
dbnode1_1 | scope: *scope
dbnode1_1 | listen: 0.0.0.0:5432
dbnode1_1 | connect_address: 172.17.0.12:5432
dbnode1_1 | data_dir: /pgdata/data
dbnode1_1 | maximum_lag_on_failover: 10485760 # 10 megabyte in bytes
dbnode1_1 | use_slots: True
dbnode1_1 | pgpass: /tmp/pgpass0
dbnode1_1 | create_replica_method: basebackup
dbnode1_1 | pg_hba:
dbnode1_1 | - local all all trust
dbnode1_1 | - host all all 0.0.0.0/0 md5
dbnode1_1 | - hostssl all all 0.0.0.0/0 md5
dbnode1_1 | replication:
dbnode1_1 | username: replicator
dbnode1_1 | password: replicate
dbnode1_1 | network: 172.17.0.12/16
dbnode1_1 | superuser:
dbnode1_1 | username: patroni
dbnode1_1 | password: super
dbnode1_1 | admin:
dbnode1_1 | username: admin
dbnode1_1 | password: admin
dbnode1_1 | parameters:
dbnode1_1 | archive_mode: "off"
dbnode1_1 | archive_command: mkdir -p ../wal_archive && cp %p ../wal_archive/%f
dbnode1_1 | wal_level: hot_standby
dbnode1_1 | max_wal_senders: 10
dbnode1_1 | hot_standby: "on"
dbnode1_1 | max_replication_slots: 7
etcd_1 | 2015-12-10 21:32:29.231848 I | raft: ce2a822cea30bfca is starting a new election at term 1
etcd_1 | 2015-12-10 21:32:29.232039 I | raft: ce2a822cea30bfca became candidate at term 2
etcd_1 | 2015-12-10 21:32:29.232325 I | raft: ce2a822cea30bfca received vote from ce2a822cea30bfca at term 2
etcd_1 | 2015-12-10 21:32:29.232492 I | raft: ce2a822cea30bfca became leader at term 2
etcd_1 | 2015-12-10 21:32:29.232682 I | raft: raft.node: ce2a822cea30bfca elected leader ce2a822cea30bfca at term 2
etcd_1 | 2015-12-10 21:32:29.233238 I | etcdserver: published {Name:default ClientURLs:[http://0.0.0.0:2379 http://0.0.0.0:4001]} to cluster 7e27652122e8b2ae
etcd_1 | 2015-12-10 21:32:29.233343 I | etcdserver: setting up the initial cluster version to 2.2
etcd_1 | 2015-12-10 21:32:29.238663 N | etcdserver: set the initial cluster version to 2.2
dbnode1_1 | 2015-12-10 21:32:29,346 INFO: Starting new HTTP connection (1): 172.17.0.11
dbnode1_1 | 2015-12-10 21:32:29,354 INFO: removing initialize key after failed attempt to initialize the cluster
dbnode1_1 | 2015-12-10 21:32:29,375 INFO: renaming data directory to /pgdata/data_2015-12-10-21-32-29
dbnode1_1 | Traceback (most recent call last):
dbnode1_1 | File "/patroni/patroni.py", line 6, in
dbnode1_1 | main()
dbnode1_1 | File "/patroni/patroni/init.py", line 73, in main
dbnode1_1 | patroni.run()
dbnode1_1 | File "/patroni/patroni/init.py", line 54, in run
dbnode1_1 | logger.info(self.ha.run_cycle())
dbnode1_1 | File "/patroni/patroni/ha.py", line 439, in run_cycle
dbnode1_1 | return self._run_cycle()
dbnode1_1 | File "/patroni/patroni/ha.py", line 405, in _run_cycle
dbnode1_1 | return self.bootstrap() # new node
dbnode1_1 | File "/patroni/patroni/ha.py", line 80, in bootstrap
dbnode1_1 | self.state_handler.bootstrap()
dbnode1_1 | File "/patroni/patroni/postgresql.py", line 672, in bootstrap
dbnode1_1 | ret = self.initialize() and self.start()
dbnode1_1 | File "/patroni/patroni/postgresql.py", line 201, in initialize
dbnode1_1 | ret = subprocess.call(self._pg_ctl + ['initdb'] + ['-o', ' '.join(options)] if options else []) == 0
dbnode1_1 | File "/usr/lib/python2.7/subprocess.py", line 522, in call
dbnode1_1 | return Popen(_popenargs, **kwargs).wait()
dbnode1_1 | File "/usr/lib/python2.7/subprocess.py", line 710, in init
dbnode1_1 | errread, errwrite)
dbnode1_1 | File "/usr/lib/python2.7/subprocess.py", line 1206, in _execute_child
dbnode1_1 | executable = args[0]
dbnode1_1 | IndexError: list index out of range
patronicompose_dbnode1_1 exited with code 1

Superfluous error on starting replication?

This is on the movebasebackup branch, but does not appear to be caused by any of the code there.

2015-10-27 17:51:05,351 INFO: Starting new HTTP connection (1): 172.17.0.5
2015-10-27 17:51:05,357 INFO: trying to bootstrap from leader
waiting for server to start....LOG: database system was interrupted; last known up at 2015-10-27 17:51:05 PDT
LOG: entering standby mode
LOG: redo starts at 0/2000028
LOG: consistent recovery state reached at 0/20000F0
LOG: database system is ready to accept read only connections
FATAL: could not start WAL streaming: ERROR: replication slot "postgresql1" does not exist

done
server started
2015-10-27 17:51:07,240 INFO: bootstrapped from leader

The strange thing about that error is that the slot exists, the replica is replicating, and everything is working. Seems like we could avoid having the error sent to the output by checking pg_replication_slots on the master. Not sure what code is giving rise to this, though.

Multiple Patronis managing a single PostgreSQL instance

We should make some kind of lock file in the PostgreSQL data directory, that would indicate that the instance is managed by Patroni of a given cluster, in order to prevent multiple Patroni daemons to attach to a single data directory. Right now this is possible, if you run a second patroni with the same data_dir, but point it to another cluster, both would compete to set the instance into the desired state (master, replica) and would likely render the instance unusable.

Implement patronictl detach

Currently, on exit Patroni tries to shut down the PostgreSQL database it manages. This makes sense generally (otherwise we won't be able to guarantee that the leftover database will keep its role), but on some occasions (i.e. upgrading Patroni itself) we might want to detach Patroni from the PostgreSQL cluster and keep the latter running.

Run is_healthiest_node method agains last know healthy configuration of cluster.

In theory it could happen that all information about cluster topology were wiped from configuration store. In this case patroni will immediately start new election process but at this moment list of cluster members would be empty or consisting of just one member. Basically this member will consider himself the most healthiest node and promote. This could lead to split-brain situation because for the real leader it takes some time to figure out that it lost leader lock.

We can overcome such issue with caching previous cluster configuration (when the cluster was healthy) and running is_healthiest_node against that cached configuration.

error if user comments out postgresql/parameters section

Two places in postgresql.py, it does this:

    for setting, value in self.config['parameters'].items():

There's two problems with this code:

if the user comments out the parameters section because there are no custom parameters, patroni errors
we shouldn't be reaching into config inside a function; this should be passed to self.server_parameters in the init function.

I'll fix this later if someone else doesn't do it first.

implement patroni-cli

In order to implement a number of features for patroni, among them issue #11, we need an actual UI for dynamic adminstration of Patroni systems. I anticipate having both a CLI and a web UI. I plan to start with the CLI.

The CLI will be a python program which installs with Patroni. It will have the following spec:

must install all patroni libraries
have a configuration file, patroni-cli.yml, which contains connection information to the DCS
might be a standard CLI, or might be a full-on command shell. Opinions?

Call: patroni-cli --config config-file --cluster clustername command [options]

--config is not optional, and contains connection information for the DCS, as well as potentially auth information for the APIs. Possibly it should be a pure positional argument.

--cluster is optional, and is there mainly for shops which have multiple clusters. For teams with only one cluster, the clustername could be contained in the config file instead. The command line paramter supercedes clustername in the config file.

Initially, I expect to implement the following commands:

status: prints out cluster status

status $node: prints detailed information on the status of a particular node

list: lists all nodes

shutdown $node: shuts down the named node

restart $node: makes node restart

failover: initates failover by having the master release the key and reboot read-only

failover $node: initiates manual failover per above, but to the named node

disable $ttl: disables auto-failover for the whole cluster, with optional TTL

resume: resumes auto-failover for the cluster

initialize $node: makes node re-initialize itself

The great connection cleanup

I'd like to rationalize how we track, store, and use postgres database connections. Currently the methods of passing connection information is quite haphazard, resulting in strange and unpredictable issues with hba and passwords. It also makes it really hard to follow the code, or to make simple fixes like issue #61.

I need to get agreement from all contributors on the way to address this, because otherwise we're headed for merge conflict hell, since touching connections and connection-passing will touch quite a bit of code.

Here is my thinking on how we should do this:

all connection information should be stored and passed as postgresql URIs.
we should only be using two types of connections to databases, which should have sensible defaults.

The two types of connections are:

connections from the patroni daemon to the local PostgreSQL, which should default to a local socket connection as the supplied superuser.
connections from one patroni node to another as the replication user, which connection information is supplied by the key in the DCS.

All other connections should go through the API; there is no reason why a patroni node should need to connect directly to another node for any purpose other than replication (and replication support).

This would also have some other, follow-on changes:

the URIs we are using will need to be fixed to be valid PostgreSQL URIs. particularly, the "database" key needs to go (proper key is "dbname").
the replication user will need perms for setting up replication slots, etc.

So ... feedback/changes/etc.?

move restore back to core libraries

I don't think moving restore.py to an outside script was on the whole a net benefit. While I understand not wanting to burden the main code with platform-specific dependencies, that puts us in the odd position of having to have a full file path in the config for something which is a critical function of Patroni. Also, if alienating the WAL-E content was your goal, naming the script "restore.py" wasn't the way to do it.

I suggest instead that we have a core function of patroni called restore, which is pg_basebackup-only restore. Other restore scripts would be called only if they are defined in config.yml, and only for alternate mechanisms (e.g. wal-e.py, barman.py).

Complete WAL-E support

There's some configuration information for WAL-E integration. However, there doesn't seem to be any code for it. What's intended here other than just writing to pg.conf?

Patroni fails when Cluster name is only numeric

Scenario: Created Spilo with version 18

Oct 21 09:21:11 ip-172-31-173-176 docker/88c787eafc06[785]: TypeError: cannot concatenate 'str' and 'int' objects
Oct 21 09:21:36 ip-172-31-173-176 docker/88c787eafc06[785]: Traceback (most recent call last):
Oct 21 09:21:36 ip-172-31-173-176 docker/88c787eafc06[785]:   File "/home/postgres/patroni/patroni.py", line 6, in <module>
Oct 21 09:21:36 ip-172-31-173-176 docker/88c787eafc06[785]:     main()
Oct 21 09:21:36 ip-172-31-173-176 docker/88c787eafc06[785]:   File "/home/postgres/patroni/patroni/__init__.py", line 67, in main
Oct 21 09:21:36 ip-172-31-173-176 docker/88c787eafc06[785]:     patroni = Patroni(config)
Oct 21 09:21:36 ip-172-31-173-176 docker/88c787eafc06[785]:   File "/home/postgres/patroni/patroni/__init__.py", line 22, in __init__
Oct 21 09:21:36 ip-172-31-173-176 docker/88c787eafc06[785]:     self.dcs = self.get_dcs(self.postgresql.name, config)
Oct 21 09:21:36 ip-172-31-173-176 docker/88c787eafc06[785]:   File "/home/postgres/patroni/patroni/__init__.py", line 31, in get_dcs
Oct 21 09:21:36 ip-172-31-173-176 docker/88c787eafc06[785]:     return Etcd(name, config['etcd'])
Oct 21 09:21:36 ip-172-31-173-176 docker/88c787eafc06[785]:   File "/home/postgres/patroni/patroni/etcd.py", line 148, in __init__
Oct 21 09:21:36 ip-172-31-173-176 docker/88c787eafc06[785]:     super(Etcd, self).__init__(name, config)
Oct 21 09:21:36 ip-172-31-173-176 docker/88c787eafc06[785]:   File "/home/postgres/patroni/patroni/dcs.py", line 126, in __init__
Oct 21 09:21:36 ip-172-31-173-176 docker/88c787eafc06[785]:     self._base_path = '/service/' + self._scope
Oct 21 09:21:36 ip-172-31-173-176 docker/88c787eafc06[785]: TypeError: cannot concatenate 'str' and 'int' objects

Use six python module to have python 2 and 3 compatibility

Currently in some places we are importing different modules depending on python version.
Actually most of such cases (it might be that everything is covered in our case) are covered by "six". Considering that six is required by kazoo anyway it doesn't make sense to reinvent wheel but use already existing solution.

Alternative to backup-config files

I wanted to suggest an atlernative to the current backup config files functionality on that branch.

We could support a conf.d directory, which would be outside PGDATA, and thus not wiped out by restores.

I believe this is a much better route than backing up and restoring conf files:

we don't have to touch those files, so they could be maintained by a CMF
doesn't involve a bunch of copying files
makes it easier to extend functionality by simply dropping files in the conf.d directory

... in fact, I plan to implement conf.d support anyway, so why not just go that way?

If there's some reason why you also need pg.conf backup, then can we please make it an optional feature? For container-based deployments, it makes no sense to be "backing up" pg.conf etc.

Initialization key should have a TTL

create a patroni node
have the wrong permissions for the /pgdata dir.
patroni will create a key, start initialization, and error out
the initialization key will remain
try to restart the node with the permissions fixed, or start a new node with the same name
get: 2015-09-02 01:07:40,863 ERROR: Unexpected response: 412 {"errorCode":105,"message":"Key already exists","cause":"/service/pgcluster1/initialize","index":11}
Node does not initialize

Add sample init scripts

This is a TODO item for me.

Add sample init/upstart/systemd scripts for the most popular platforms.

Move AWS-specific code to callbacks

Patroni should be compatible with traditional datacenter, as well as with most cloud providers. Therefore, we don't want AWS-specifc stuff (like backups to S3) to be part of it. Instead, it should be possible to call external programs via callbacks at various moments, i..e at server start/stop, restart/reload, promote/demote.

Implement a delayed replica

Currently, all replicas in the Spilo cluster are equal and can take over the role of the master when necessary. However, it is impossible to designate a replica as delayed by an hour or two, which in practice is very useful for mitigating disasters like dropping a very important table or running a delete without the where clause.

It would be an improvement to have a replica running in a delayed mode with a configurable delay. Obviously, such replica won't be able to participate in the failover process, but that's ok for the use case we are addressing.

We might think about implementing this after having the ability to mark a single node as non-participating in the election. Alternatively, using the external master feature, we may just run the whole spilo cluster in a standby mode, and add another parameter to customize such a mode: that might be either a replication delay for delayed replicas, or max_standby_archive/streaming delay to run analytical replicas without affecting the availability of the normal ones.

Docker image does not exist

https://os-registry.stups.zalan.do/acid/patroni-1.0

That's a 404.

I'm going to upload something to the docker hub, once I have it built.

initialization fails if superuser.name=postgres

With HEAD as of a couple days ago, if you set the following:

superuser:
username: postgres
password: zalando

For a new database system, you will get this error on initialization:

2015-10-29 19:40:06,302 INFO: established a new patroni connection to the postgres cluster
ERROR:  role "postgres" already exists
STATEMENT:  CREATE ROLE "postgres" WITH LOGIN SUPERUSER PASSWORD 'super'
2015-10-29 19:40:06,316 INFO: removing initialize key after failed attempt to initialize the cluster
2015-10-29 19:40:06,317 INFO: closed patroni connection to the postgresql cluster
waiting for server to shut down...LOG:  received immediate shutdown request
.WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
 done
server stopped
2015-10-29 19:40:07,349 INFO: renaming data directory to /pgdata/data_2015-10-29-19-40-07
2015-10-29 19:40:07,350 ERROR: Error communicating with Postgresql. Will try again later
Traceback (most recent call last):
  File "build/bdist.linux-x86_64/egg/patroni/ha.py", line 380, in _run_cycle
    return self.bootstrap()  # new node
  File "build/bdist.linux-x86_64/egg/patroni/ha.py", line 79, in bootstrap
    self.state_handler.bootstrap()
  File "build/bdist.linux-x86_64/egg/patroni/postgresql.py", line 647, in bootstrap
    self.create_connection_users()
  File "build/bdist.linux-x86_64/egg/patroni/postgresql.py", line 588, in create_connection_users
    self.superuser['username']), self.superuser['password'])
  File "build/bdist.linux-x86_64/egg/patroni/postgresql.py", line 160, in query
    return self.retry(self._query, sql, *params)
  File "build/bdist.linux-x86_64/egg/patroni/utils.py", line 153, in __call__
    return func(*args, **kwargs)
  File "build/bdist.linux-x86_64/egg/patroni/postgresql.py", line 153, in _query
    raise e
ProgrammingError: role "postgres" already exists

Then initialization will fail ... and be reattempted, in a continuous loop.

For all of the roles, we should check if they already exist, and only create them if they don't (but set passwords if they do).

Will work on patch this week.

why are we parsing URIs?

The connection strings to members of the cluster are in the form of URIs. For PostgreSQL 9.2 and later, URIs are supported by psycopg2 as connection strings, and are supported by various PostgreSQL utilities. Yet for some reason, the Patroni code spends a lot of time parsing and recomposing URIs instead of just using them. Why?

This relates to issue #61.

Implement possibility of manual failover to specific node

Support bootstrapping spilo cluster from the external PostgreSQL database

There is a valid use case of moving already running databases to spilo without doing an expensive dump/restore. In order to support it, we can just pretend that an external PostgreSQL data service is part of the spilo cluster, acting as a leader, with TTL that never expires. Once the cluster is started, spilo nodes will choose replica roles (as the leader is already taken) and will continue to act as replicas until the leader key is explicitly removed by spilo (since it cannot expire on its own).

In order to support this use case, the following parameters should be changed:

a new primary_conninfo parameter should be added to the yaml configuration for Patroni.
a global 'standby_mode' flag should be adedd to the DCS at /service/cluster_name/, inidicating that the cluster is containing a non-Patroni managed leader, while all the spilo nodes are running as replicas. This flag would change some operations, i.e. replication slots creation/removal.

From the user point-of-view, the spilo cluster should be configured with the primary_conninfo, pointing to the external PostgreSQL database.
This connection string should also allow normal (non-replication) superuser connections, and the pg_hba.conf on the cluster should be configured to permit such connections from the addresses that would belong to the spilo cluster.

The changes required in the spilo workflow are the following:

When the first patroni node joins the new (non-initialized cluster), it would set the following keys right after acquiring the initialize one:
- /service/cluster_name/member/external_$random_suffix with the connection string to the external service (primary_conninfo parameter)
- /service/cluster_name/leader with the value "external_$random_suffix"\
- /service/cluster_name/leader/optime with the value 0 (since the leader is not supposed to update the key, but it's still required for some code inside Patroni)
- /service/cluster_name/stanby_mode to signal that the cluster is running in a special mode.
When the other spilo nodes join the DCS and observe the standby_mode flag, they start sleeping, periodically checking until this flag is gone. This would prevent cloning an external source multiple times.
The active patroni will connect to the master and create a new replication slot. Afterwards, it will clone the data from the master and will run the streaming replication.
At the point the DBA decides the cluster should be separated from the external PostgreSQL node, it would remove the leader key and the member entry, corresponding to that key, also cleaning the replication slot created in the external PostgreSQL node. At this point, the active spilo cluster will come up, taking the leader role (unconditionally, by observing the standby_mode flag) and remove the standby_mode flag, so that other spilo members can start building itself from the leader. Note that the attempt to remove that key before the active spilo replica becomes available should be ignored.
That command should be built into the spilo comamand-line client, as it's not possible to find the promotion point automatically (for instance, we might stream current changes from the external master to spilo until a certain scheduled maintaince window, where the spilo cluster will be set as a master).

If the active spilo node dies while the standby_mode is still set, another node should notice it and take its place.

Add new commands to API

We need the following commands in the API to support manual failover and other reasonable operations:

shutdown
promote
restart
reinitialize

This will require adding new paths to the API. Given that, do we want to stick with SimpleHTTPServer?

we should enable data page checksums

If patroni is initdb'ing, we should enable data page checksums because:

it's a good idea
performance impact is not noticeable under most circumstances
prior to 9.5, it's required for pg_rewind to be safe to use

fix name, location of pgpass file

just a note because I won't get to it in the next couple days: the pgpass file created by patroni is named incorrectly. it needs a leading period.

Also, this pgpass file is automatically created in the working directory from which patroni is called, which isn't necessarily the user's home directory.

Exception when changing replication slots

Hi,

Just trying out patroni for the first time (after giving up on governor). I seem to have the replication working correctly, on my secondaries I see following error :-

2015-12-10 21:04:19,823 ERROR: Exception when changing replication slots

Log snippet with stack trace: http://pastie.org/private/aup2rpvjfxsmgsjiwv9noa

Is that something to be worried about? Changes done on master are visible on the secondaries.

-Sajal

Add HAProxy autoconfiguration

In a multi-replica environment, HAProxy needs to be configured and updated with the correct current master and replicas. Add a method to update haproxy.cfg with this information.

How I imagine this working is a separate, smaller daemon which monitors for changes of master and updates haproxy.cfg on the haproxy server(s).

Support for Synchronous Replication

Using synchronous_standby_names (one sync replica, 0+ async replicas)

http://www.postgresql.org/docs/9.4/static/warm-standby.html#SYNCHRONOUS-REPLICATION

Security for API

Currently the API is completely unsecured. This limits us adding new features to the API which could actually destroy or reveal data, and isn't a great idea in general. As such, I'd like two things for the API:

simple user/password auth
SSL support

There's two ways we could approach this: one would be to add these things using various snippets to the existing API.py (for example: https://gist.github.com/fxsjy/5465353). The second way would be to offer sample configurations for an nginx proxy for the API.

Thoughts?

Shutting down PostgreSQL may wait indefinitely

It has happened that PostgreSQL does not shut down when signaled. Patroni then waits indefinitely for postgres to shut down.

It may be a ok idea to do a shutdown -m immediate after n minutes if postgres doesn't shutdown.