zalando / spilo Goto Github PK

Highly available elephant herd: HA PostgreSQL cluster using Docker

License: Apache License 2.0

Python 61.06% Shell 30.60% PLpgSQL 6.56% Dockerfile 1.72% C 0.05%

python docker docker-image postgresql high-availability patroni data-infrastructure

spilo's Introduction

Spilo: HA PostgreSQL Clusters with Docker

Spilo is a Docker image that provides PostgreSQL and Patroni bundled together. Patroni is a template for PostgreSQL HA. Multiple Spilos can create a resilient High Available PostgreSQL cluster. For this, you'll need to start all participating Spilos with identical etcd addresses and cluster names.

Spilo's name derives from სპილო [spiːlɒ], the Georgian word for "elephant."

Real-World Usage and Plans

Spilo is currently evolving: Its creators are working on a Postgres operator that would make it simpler to deploy scalable Postgres clusters in a Kubernetes environment, and also do maintenance tasks. Spilo would serve as an essential building block for this. There is already a Helm chart that relies on Spilo and Patroni to provision a five-node PostgreSQL HA cluster in a Kubernetes+Google Compute Engine environment. (The Helm chart deploys Spilo Docker images, not just "bare" Patroni.)

How to Use This Docker Image

Spilo's setup assumes that you've correctly configured a load balancer (HAProxy, ELB, Google load balancer) that directs client connections to the master. There are two ways to achieve this: A) if the load balancer relies on the status code to distinguish between the healthy and failed nodes (like ELB), then one needs to configure it to poll the API URL; otherwise, B) you can use callback scripts to change the load balancer configuration dynamically.

Available container registry and image architectures

Spilo images are made available in the GitHub container registry (ghcr.io). Images are build and published as linux/amd64 and linux/arm64 on tag. For PostgreSQL version 14 currently availble images can be found here: https://github.com/zalando/spilo/pkgs/container/spilo-14

How to Build This Docker Image

$ cd postgres-appliance

$ docker build --tag $YOUR_TAG .

There are a few build arguments defined in the Dockerfile and it is possible to change them by specifying --build-arg arguments:

WITH_PERL=false # set to true if you want to install perl and plperl packages into image
PGVERSION="12"
PGOLDVERSIONS="9.5 9.6 10 11"
DEMO=false # set to true to build the smallest possible image which will work only on Kubernetes
TIMESCALEDB_APACHE_ONLY=true # set to false to build timescaledb community version (Timescale License)
TIMESCALEDB_TOOLKIT=true # set to false to skip installing toolkit with timescaledb community edition. Only relevant when TIMESCALEDB_APACHE_ONLY=false
ADDITIONAL_LOCALES= # additional UTF-8 locales to build into image (example: "de_DE pl_PL fr_FR")

Run the image locally after build:

$ docker run -it your-spilo-image:$YOUR_TAG

Have a look inside the container:

$ docker exec -it $CONTAINER_NAME bash

Connecting to PostgreSQL

Administrative Connections

PostgreSQL is configured by default to listen to port 5432. Spilo master initializes PostgreSQL and creates the superuser and replication user (postgres and standby by default).

You'll need to setup Spilo to create a database and roles for your application(s). For example:

psql -h myfirstspilo.example.com -p 5432 -U admin -d postgres

Application Connections

Once you have created a database and roles for your application, you can connect to Spilo just like you want to connect to any other PostgreSQL cluster:

psql -h myfirstspilo.example.com -p 5432 -U wow_app -d wow
psql -d "postgresql://myfirstspilo.example.com:5432/wow?user=wow_app"

Configuration

Spilo is configured via environment variables, the values of which are either supplied manually via the environment (whenever Spilo is launched as a set of Docker containers) or added in the configuration file or manifest (whenever Spilo is used in the Docker orchestration environment, such as Kubernetes or Docker Compose).

Please go here to see our list of environment variables.

To supply env variables manually via the environment for local testing:

docker run -it -e YOUR_ENV_VAR=test your-spilo-image:latest

Issues and Contributing

Spilo welcomes questions via our issues tracker. We also greatly appreciate fixes, feature requests, and updates; before submitting a pull request, please visit our contributor guidelines.

License

This project uses the Apache 2.0 license.

spilo's People

Contributors

Stargazers

Watchers

Forkers

vosmann ychahbi elgalu syaroslavtsev linki danieldelhoyo andre82hh-zz boazjohn olivierh59500 alvinhom illume a1exsh woopstar joar alphavertex chenjun3092 open-source-archive koulio 40a roll4life captjt geminimvp joe-pll mrodm daggerok michailbrynard humaniq grigala nicoduvenage sapcc msfeghouli antiarchitect phan-pivotal cliu- nootys ruintterra mpritter76 nexiles rehive tnextday tom2jack rogervaas aknuds1 shubhanilbag zhuomingliang farhan5900 theinrichs andersquist pitabwire pocorschi razvanmargineanu alpex29 k1ng440 johnnyqqqq ants ilyasemenov feikesteenbergen alethio colonynetworks nrkno srikanth-medikonda necrolyte2 rafiasabih tellef alfredw33 psiphon-inc terradatum sebholstein robert-lynch nicodimi nguyen127001 jawher kupson fxku alice-sawatzky orchestracities dittoservices jubayerarefin eyusupov cazter sysbind federico-d qchojr patagona radiocutfm scherniavsky shashankv02 toronz marcoslarsen savemech zw39125432 adriannemo flytro researchiteng thorsten-totemic zhutony aditya1808 dmayle bluebossa63 lescactus

spilo's Issues

patroni entered a FATAL state

The following is from fresh install of patroni using the helm chart provided by incubator/patroni. All of the etcd and spilo nodes come up but all of the spilo nodes have the following error.
Environment is Canonical Distribution of Kubernetes 1.7 running on top of Openstack

If there is a better place to report this issue, please let me know and I will report elsewhere.

2017-08-29 00:17:19,541 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Local?)
Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/requests/packages/urllib3/connection.py", line 141, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/usr/local/lib/python3.4/dist-packages/requests/packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.4/socket.py", line 533, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/requests/packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/local/lib/python3.4/dist-packages/requests/packages/urllib3/connectionpool.py", line 356, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.4/http/client.py", line 1125, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python3.4/http/client.py", line 1163, in _send_request
    self.endheaders(body)
  File "/usr/lib/python3.4/http/client.py", line 1121, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python3.4/http/client.py", line 951, in _send_output
    self.send(msg)
  File "/usr/lib/python3.4/http/client.py", line 886, in send
    self.connect()
  File "/usr/local/lib/python3.4/dist-packages/requests/packages/urllib3/connection.py", line 166, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.4/dist-packages/requests/packages/urllib3/connection.py", line 150, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.HTTPConnection object at 0x7fd6b0237160>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/requests/adapters.py", line 423, in send
    timeout=timeout
  File "/usr/local/lib/python3.4/dist-packages/requests/packages/urllib3/connectionpool.py", line 649, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/local/lib/python3.4/dist-packages/requests/packages/urllib3/util/retry.py", line 376, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='instance-data', port=80): Max retries exceeded with url: /latest/meta-data/placement/availability-zone (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd6b0237160>: Failed to establish a new connection: [Errno -2] Name or service not known',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/configure_spilo.py", line 526, in <module>
    main()
  File "/configure_spilo.py", line 475, in main
    placeholders = get_placeholders(provider)
  File "/configure_spilo.py", line 309, in get_placeholders
    placeholders['instance_data'] = get_instance_metadata(provider)
  File "/configure_spilo.py", line 256, in get_instance_metadata
    metadata[k] = requests.get('{}/{}'.format(url, v or k), timeout=2, headers=headers).text
  File "/usr/local/lib/python3.4/dist-packages/requests/api.py", line 70, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/requests/api.py", line 56, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/requests/sessions.py", line 488, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.4/dist-packages/requests/sessions.py", line 609, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/requests/adapters.py", line 487, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='instance-data', port=80): Max retries exceeded with url: /latest/meta-data/placement/availability-zone (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd6b0237160>: Failed to establish a new connection: [Errno -2] Name or service not known',))
2017-08-29 00:17:20,387 CRIT Supervisor running as root (no user in config file)
2017-08-29 00:17:20,387 WARN Included extra file "/etc/supervisor/conf.d/patroni.conf" during parsing
2017-08-29 00:17:20,387 WARN Included extra file "/etc/supervisor/conf.d/cron.conf" during parsing
2017-08-29 00:17:20,388 WARN Included extra file "/etc/supervisor/conf.d/pgq.conf" during parsing
2017-08-29 00:17:20,416 INFO RPC interface 'supervisor' initialized
2017-08-29 00:17:20,416 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2017-08-29 00:17:20,417 INFO supervisord started with pid 1
2017-08-29 00:17:21,419 INFO spawned: 'cron' with pid 25
2017-08-29 00:17:21,421 INFO spawned: 'patroni' with pid 26
2017-08-29 00:17:21,423 INFO spawned: 'pgq' with pid 27
Usage: /usr/local/bin/patroni config.yml
	Patroni may also read the configuration from the PATRONI_CONFIGURATION environment variable
2017-08-29 00:17:21,773 INFO exited: patroni (exit status 1; not expected)
2017-08-29 00:17:22,774 INFO success: cron entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-08-29 00:17:22,775 INFO spawned: 'patroni' with pid 33
2017-08-29 00:17:22,776 INFO success: pgq entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Usage: /usr/local/bin/patroni config.yml
	Patroni may also read the configuration from the PATRONI_CONFIGURATION environment variable
2017-08-29 00:17:23,130 INFO exited: patroni (exit status 1; not expected)
2017-08-29 00:17:25,134 INFO spawned: 'patroni' with pid 35
Usage: /usr/local/bin/patroni config.yml
	Patroni may also read the configuration from the PATRONI_CONFIGURATION environment variable
2017-08-29 00:17:25,468 INFO exited: patroni (exit status 1; not expected)
2017-08-29 00:17:28,473 INFO spawned: 'patroni' with pid 37
Usage: /usr/local/bin/patroni config.yml
	Patroni may also read the configuration from the PATRONI_CONFIGURATION environment variable
2017-08-29 00:17:28,804 INFO exited: patroni (exit status 1; not expected)
2017-08-29 00:17:29,805 INFO gave up: patroni entered FATAL state, too many start retries too quickly
2017-08-29 00:21:32,284 INFO spawned: 'patroni' with pid 75
Usage: /usr/local/bin/patroni config.yml
	Patroni may also read the configuration from the PATRONI_CONFIGURATION environment variable
2017-08-29 00:21:32,634 INFO exited: patroni (exit status 1; not expected)
2017-08-29 00:21:33,637 INFO spawned: 'patroni' with pid 77
Usage: /usr/local/bin/patroni config.yml
	Patroni may also read the configuration from the PATRONI_CONFIGURATION environment variable
2017-08-29 00:21:34,010 INFO exited: patroni (exit status 1; not expected)
2017-08-29 00:21:36,014 INFO spawned: 'patroni' with pid 79
Usage: /usr/local/bin/patroni config.yml
	Patroni may also read the configuration from the PATRONI_CONFIGURATION environment variable
2017-08-29 00:21:36,350 INFO exited: patroni (exit status 1; not expected)
2017-08-29 00:21:39,355 INFO spawned: 'patroni' with pid 81
Usage: /usr/local/bin/patroni config.yml
	Patroni may also read the configuration from the PATRONI_CONFIGURATION environment variable
2017-08-29 00:21:39,698 INFO exited: patroni (exit status 1; not expected)
2017-08-29 00:21:40,699 INFO gave up: patroni entered FATAL state, too many start retries too quickly
2017-08-29 00:23:13,550 INFO spawned: 'patroni' with pid 92
Usage: /usr/local/bin/patroni config.yml
	Patroni may also read the configuration from the PATRONI_CONFIGURATION environment variable
2017-08-29 00:23:13,883 INFO exited: patroni (exit status 1; not expected)
2017-08-29 00:23:14,886 INFO spawned: 'patroni' with pid 94
Usage: /usr/local/bin/patroni config.yml
	Patroni may also read the configuration from the PATRONI_CONFIGURATION environment variable
2017-08-29 00:23:15,250 INFO exited: patroni (exit status 1; not expected)
2017-08-29 00:23:17,254 INFO spawned: 'patroni' with pid 96
Usage: /usr/local/bin/patroni config.yml
	Patroni may also read the configuration from the PATRONI_CONFIGURATION environment variable
2017-08-29 00:23:17,607 INFO exited: patroni (exit status 1; not expected)
2017-08-29 00:23:20,613 INFO spawned: 'patroni' with pid 98
Usage: /usr/local/bin/patroni config.yml
	Patroni may also read the configuration from the PATRONI_CONFIGURATION environment variable
2017-08-29 00:33:44,999 INFO exited: patroni (exit status 1; not expected)
2017-08-29 00:33:46,000 INFO gave up: patroni entered FATAL state, too many start retries too quickly

AWS credentials

I see there's a GOOGLE_APPLICATION_CREDENTIALS environmental variable. Is there a similar one for AWS? Or how does the S3 bucket authentication work?

Spilo should set the cluster_name for 9.5+

It would make it easier to identify a cluster from a DB connection:

See https://www.depesz.com/2014/07/02/waiting-for-9-5-add-cluster_name-guc-which-is-included-in-process-titles-if-set/

Restore with wal_e doesn't work

Missing:

postgresql:
  create_replica_method:
  - wal_e
  - basebackup

So it always defaults to basebackup method

Remove the hacky patching of boto num_retries

Use configuration file as documented here: http://docs.pythonboto.org/en/latest/boto_config_tut.html

Use ec2 instance ID as the name to identify a cluster

Currently the clustername is generated by using the HOSTNAME (which is inside Docker). This does not help in identifying which member runs on which ec2 instance.

Proposal: Include the Instance ID in the name, postfixed with a short random string to enable multiple containers running on the same ec2 instance.

ISO C90 forbids variable length array 'ct' [-Wvla] issue

Hi, I tried to run docker build of the " spilo/postgres-appliance/Dockerfile", and got an error as below.

pam_oauth2.c: In function 'pam_sm_authenticate':
pam_oauth2.c:195:12: warning: ISO C90 forbids variable length array 'ct' [-Wvla]
struct check_tokens ct[argc];

My system is CentOS 7.
Any suggestion would be appreciate.

Helm chart - use external etcd

Any way I can specify e.g. etcd-operator to be used with patroni chart instead of built in etcd?

spilo versions

spilo should have tagged versions, like patroni.

It would be convenient if the tagged versions of spilo matches the versions of patroni.

If spilo needs additional releases for any patroni version, e.g.to fix issues within spilo without waiting for a new patroni release, it could add the +spilo.$BUMP_INCR suffix to the patroni version, similar to how Ubuntu APT packages add the -ubuntu42 suffix.

Etcd appliance: python-boto ubuntu package is too old

It is not aware of eu-central-1

pg_hba entries not injected into Spilo

Goes for spilo:1.0-p2

User specified and dynamic configuration

The current patroni configuration is static: It does take some specific values if explicitly set (e.g. DCS) or from the environment (IP, name).

The proposal is:

Have some defaults
Override the defaults with environment specific values, examples:
- name
- IP
- shared_buffers
Override these values with the user supplied values, optionally with a filter.

Tag EBS volume with Name early

Currently tagging of a new EBS volume with Name seems to happen only after base backup is finished to be streamed. This makes it harder to monitor performance of the EBS using Amazon Web console or any other tools where you would filter the volumes by Name tag.

Unable to boostrap when using Patroni 1.0

With 84cadfb Spilo is bumped to Patroni 1.0.

However, when boostrapping a new cluster, it loops in trying to bootstrap without leader using wal-e:

2016-07-06 14:23:37,613 ERROR: could not query wal-e latest backup: Command '['envdir', '/home/postgres/etc/wal-e.d/env', 'wal-e', '--aws-instance-profile', 'backup-list', '--detail', 'LATEST']' returned non-zero exit status 1
2016-07-06 14:23:37,636 ERROR: failed to bootstrap (without leader)
2016-07-06 14:23:37,636 INFO: Removing data directory: /home/postgres/pgdata/data

When removing the create_replica_method altogether from the yaml it also loops waiting for the leader to bootstrap:

2016-07-06 14:26:55,435 INFO success: cron entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-07-06 14:26:55,435 INFO success: patroni entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-07-06 14:26:55,435 INFO success: pgq entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-07-06 14:27:04,431 INFO: waiting for leader to bootstrap
2016-07-06 14:27:14,431 INFO: waiting for leader to bootstrap
2016-07-06 14:27:24,431 INFO: waiting for leader to bootstrap
2016-07-06 14:27:34,432 INFO: waiting for leader to bootstrap

contents of etcd at time of looping:

{
    "action": "get",
    "node": {
        "createdIndex": 80081,
        "dir": true,
        "key": "/service/dummy",
        "modifiedIndex": 80081,
        "nodes": [
            {
                "createdIndex": 80081,
                "dir": true,
                "key": "/service/dummy/members",
                "modifiedIndex": 80081,
                "nodes": [
                    {
                        "createdIndex": 80114,
                        "expiration": "2016-07-06T14:29:24.431983611Z",
                        "key": "/service/dummy/members/42644697d238",
                        "modifiedIndex": 80114,
                        "ttl": 29,
                        "value": "{\"role\":\"uninitialized\",\"state\":\"stopped\",\"conn_url\":\"postgres://172.17.0.2:5432/postgres\",\"api_url\":\"http://172.17.0.2:8008/patroni\"}"
                    }
                ]
            }
        ]
    }
}

The problem is not with Patroni 1.0 per se: Using the Docker image from the Patroni repo works as expected (members are bootstrapped)

Missing file scm-source.json file

In Dokcerfile , line 161, where is scm-source.json file??

Allow S3-only replication

This is more a discussion, than a new feature. The goal is to run the system with replicas on AWS and master in a legacy datacenter using S3 as the only medium to ship changes.

Right now, we can achieve this by creating the master member and leader keys manually without the TTL, pointing them somewhere. The replicas won't be able to connect to the master, but they will be able to initialize itself from the base backup on S3, as well as to ship WAL segments from there.

The problem is to find out how to ship WALs from the legacy datacenter to S3. We probably don't want to do it from the master, because the connection disruption from the master to AWS would risk the master running out of WAL disk space. We might use pg_receivexlog to collect those segments elsewhere and another process, using the inotify subsystem, to call WAL-E once the new segment is written by pg_receivexlog. As a side note, it would be great if pg_receivexlog could call an external command for each WAL it writes.

Use new Zalando Open Source Docker registry

Spilo should be pushed to the new Open Source Zalando Docker registry: https://registry.opensource.zalan.do/ui/

$ pierone login --url registry.opensource.zalan.do
$ docker push registry.opensource.zalan.do/acid/spilo-9.4:1.0

Readme rewrite draft: link

Hey @alexeyklyukin @CyberDem0n, I've begun creating a simple README for the next iteration of Spilo here.

It wasn't clear from the current Spilo README+ Read the Docs that Spilo was a Docker image; sometimes it's important to state the obvious. Even Docker images on the Docker hub say "this Docker image." With this in mind, I've added the obvious. :)

I've also drawn upon the standard text of the official Postgres Docker image and other DB-related Docker images at the Docker Hub, to brainstorm section headings for you to consider adding to your README. Feel free to discard what's irrelevant. My advice is to just keep simple whatever is essential and not subject to change with Spilo Phase II.

Speaking of which: I realize that Spilo's undergoing transformation, and that the postgres-operator's in progress. You might drop some lines about the operator inside Spilo's readme, to share your project vision more openly. This might then inspire feedback from others in the community, who will be curious as well as insightful.

No logging of postgresql after Patroni was restarted

Situation:

patroni died
postgres kept running
patroni was respawned, no action - leader was still the same

After this, no logging of PostgreSQL was visible

Jul 27 14:51:58 ip-172-31-157-143 docker/f7d1ae09a079[884]: 2016-07-27 14:51:58,099 INFO: Starting new HTTP connection (1): 172.31.141.11
Jul 27 14:51:58 ip-172-31-157-143 docker/f7d1ae09a079[884]: 2016-07-27 14:51:58,610 INFO: closed patroni connection to the postgresql cluster
Jul 27 14:51:58 ip-172-31-157-143 docker/f7d1ae09a079[884]: Traceback (most recent call last):
Jul 27 14:51:58 ip-172-31-157-143 docker/f7d1ae09a079[884]:   File "/usr/local/bin/patroni", line 9, in <module>
Jul 27 14:51:58 ip-172-31-157-143 docker/f7d1ae09a079[884]:     load_entry_point('patroni==0.90', 'console_scripts', 'patroni')()
Jul 27 14:51:58 ip-172-31-157-143 docker/f7d1ae09a079[884]:   File "/usr/local/lib/python2.7/dist-packages/patroni/__init__.py", line 105, in main
Jul 27 14:51:58 ip-172-31-157-143 docker/f7d1ae09a079[884]:     patroni.postgresql.stop(checkpoint=False)
Jul 27 14:51:58 ip-172-31-157-143 docker/f7d1ae09a079[884]:   File "/usr/local/lib/python2.7/dist-packages/patroni/postgresql.py", line 398, in stop
Jul 27 14:51:58 ip-172-31-157-143 docker/f7d1ae09a079[884]:     if not self.is_running():
Jul 27 14:51:58 ip-172-31-157-143 docker/f7d1ae09a079[884]:   File "/usr/local/lib/python2.7/dist-packages/patroni/postgresql.py", line 317, in is_running
Jul 27 14:51:58 ip-172-31-157-143 docker/f7d1ae09a079[884]:     return subprocess.call(' '.join(self._pg_ctl) + ' status > /dev/null 2>&1', shell=True) == 0
Jul 27 14:51:58 ip-172-31-157-143 docker/f7d1ae09a079[884]:   File "/usr/lib/python2.7/subprocess.py", line 522, in call
Jul 27 14:51:58 ip-172-31-157-143 docker/f7d1ae09a079[884]:     return Popen(*popenargs, **kwargs).wait()
Jul 27 14:51:58 ip-172-31-157-143 docker/f7d1ae09a079[884]:   File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
Jul 27 14:51:58 ip-172-31-157-143 docker/f7d1ae09a079[884]:     errread, errwrite)
Jul 27 14:51:58 ip-172-31-157-143 docker/f7d1ae09a079[884]:   File "/usr/lib/python2.7/subprocess.py", line 1223, in _execute_child
Jul 27 14:51:58 ip-172-31-157-143 docker/f7d1ae09a079[884]:     self.pid = os.fork()
Jul 27 14:51:58 ip-172-31-157-143 docker/f7d1ae09a079[884]: OSError: [Errno 12] Cannot allocate memory
Jul 27 14:51:58 ip-172-31-157-143 docker/f7d1ae09a079[884]: 2016-07-27 14:51:58,765 INFO exited: patroni (exit status 1; not expected)
Jul 27 14:51:59 ip-172-31-157-143 docker/f7d1ae09a079[884]: 2016-07-27 14:51:59,771 INFO spawned: 'patroni' with pid 30418

AWS support uses unsupported DNS name

While troubleshooting a chart that depends on patroni, we discovered that Spilo tries to get AWS instance data via a request to http://instance-data/... in one location only (other locations use the IP address 169.254.169.254). This hostname isn't documented or supported by AWS:
https://forums.aws.amazon.com/message.jspa?messageID=536813
and in some cases it doesn't work (including ours). We were able to work around by cramming it into our kube DNS server but it would be more consistent to make the request by IP address.

One other note: while reading the current issue list, I noticed #186 and read through the comments. I'm not familiar with how OpenStack is supported but it sounded like they might be relying on the DNS name lookup failure? I'm not sure how fixing this might affect that behavior.

Non-existing Zalando Docker Ubuntu base image is referenced

Please migrate to e.g. registry.opensource.zalan.do/stups/ubuntu:16.04-28

S3 backups with helm chart

I am running Patroni in Kubernetes on AWS and I have been trying to configure backups to S3. Is there a clear way to add in S3 integration with the helm chart? I have attempted forking it and also have added this type of fix in. I just want to make sure this is the only clear way to move forward with.

Thanks,
Jordan

Postgres Template is broken

Do you want a replica ELB? [y/N]: N
Traceback (most recent call last):
  File "/usr/local/bin/senza", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.5/site-packages/senza/cli.py", line 1383, in main
    handle_exceptions(cli)()
  File "/usr/local/lib/python3.5/site-packages/senza/cli.py", line 250, in wrapper
    func()
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/senza/cli.py", line 808, in init
    variables = module.gather_user_variables(variables, region, account_info)
  File "/usr/local/lib/python3.5/site-packages/senza/templates/postgresapp.py", line 396, in gather_user_variables
    default=get_vpc_attribute(account_info.VpcID, 'cidr_block'))
  File "/usr/local/lib/python3.5/site-packages/senza/aws.py", line 35, in get_vpc_attribute
    ec2 = boto3.resource('ec2')
  File "/usr/local/lib/python3.5/site-packages/boto3/__init__.py", line 87, in resource
    return _get_default_session().resource(*args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/boto3/session.py", line 292, in resource
    aws_session_token=aws_session_token, config=config)
  File "/usr/local/lib/python3.5/site-packages/boto3/session.py", line 200, in client
    aws_session_token=aws_session_token, config=config)
  File "/usr/local/lib/python3.5/site-packages/botocore/session.py", line 796, in create_client
    client_config=config, api_version=api_version)
  File "/usr/local/lib/python3.5/site-packages/botocore/client.py", line 62, in create_client
    verify, credentials, scoped_config, client_config)
  File "/usr/local/lib/python3.5/site-packages/botocore/client.py", line 227, in _get_client_args
    endpoint_url)
  File "/usr/local/lib/python3.5/site-packages/botocore/client.py", line 117, in _get_signature_version_and_region
    service_model.endpoint_prefix, region_name, scheme=scheme)
  File "/usr/local/lib/python3.5/site-packages/botocore/regions.py", line 70, in construct_endpoint
    raise NoRegionError()
botocore.exceptions.NoRegionError: You must specify a region.

Etcd discovery domain doesn't work with plain Kubernetes

Attempting to set the ETCD_DISCOVERY_DOMAIN to an etcd service on Kubernetes without helm results in etcd not working.

2016-11-06 07:43:05,803 INFO: waiting on etcd
2016-11-06 07:43:10,817 ERROR: Can not resolve SRV for etcd.default.svc.cluster.local
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/patroni/dcs/etcd.py", line 150, in get_srv_record
return [(str(r.target).rstrip('.'), r.port) for r in resolver.query('_etcd-server._tcp.' + host, 'SRV')]
File "/usr/local/lib/python3.4/dist-packages/dns/resolver.py", line 1068, in query
raise_on_no_answer, source_port)
File "/usr/local/lib/python3.4/dist-packages/dns/resolver.py", line 995, in query
raise NXDOMAIN(qname=qnames_to_try)
dns.resolver.NXDOMAIN: None of DNS query names exist: ['_etcd-server._tcp.etcd.default.svc.cluster.local.', '_etcd-server._tcp.etcd.default.svc.cluster.local.default.svc.cluster.local.', '_etcd-server._tcp.etcd.default.svc.cluster.local.svc.cluster.local.', '_etcd-server._tcp.etcd.default.svc.cluster.local.cluster.local.', '_etcd-server._tcp.etcd.default.svc.cluster.local.us-west-2.compute.internal.']

I also tried setting ETCD_HOST instead, but that resulted in patroni not starting at all:

2016-11-06 08:00:45,833 INFO spawned: 'pgq' with pid 33
Usage: /usr/local/bin/patroni config.yml
Patroni may also read the configuration from the PATRONI_CONFIGURATION environment variable
2016-11-06 08:00:46,019 INFO exited: patroni (exit status 1; not expected)
2016-11-06 08:00:47,020 INFO success: cron entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-11-06 08:00:47,021 INFO spawned: 'patroni' with pid 39
2016-11-06 08:00:47,021 INFO success: pgq entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Usage: /usr/local/bin/patroni config.yml
Patroni may also read the configuration from the PATRONI_CONFIGURATION environment variable

I suspect the latter is due to quoting with the : in the etcd host name, but due to the value pass-through, it's not really fixable in the Kube file.

Connection lost on a regular basis

The ELB in use by Spilo has a short timeout for connections:

"sets the idle timeout to 60 seconds"
http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/config-idle-timeout.html

This may be an issue for connections that are idling.

Make default DCS path more unique

The default DCS path /service/{scope} might look a bit confusing if the scope doesn't include spilo anywhere in it.

For example, when testing I have the habit of creating stacks while putting just an increasing number for the version, e.g. senza create myspilo.yaml 7. This will end up with /service/7 in the DCS, which might be not unique enough, especially given there is usually no authn/authz in the DCS: everyone can write anywhere.

How about using /service/spilo/{scope} by default instead? Can this be configured at the moment?

Decouple Security Groups of ELB and ec2 instances

Currently the ELB and the ec2 instances share a Security Group.
This leaves out some configuration options that seem like valid use cases.

Proposal is a SG for the ELB and a SG for the ec2 (default access: from ELB only)

Find a way to encrypt secrets, i.e usernames/passwords/keys on K8S

Basically the equivalent to KMS on AWS, but running on GCE/K8S, and plug it in to Spilo.

Open port 9100 in the security group for ZMON to get the CloudWatch Metrics

The ZMON Zalando monitoring tool can be deployed into each AWS account to allow cross-team monitoring and dashboards. Make sure that ZMON appliance is allowed by security groups to connect to port 9100 of monitored instances.

http://docs.stups.io/en/latest/user-guide/monitoring.html

Create DNS record for the slave instances

To be able to connect easily to the slaves it would be very useful if a DNS record is kept up-to-date.

Disable Etcd

Hi, first of all thanks for your work, it's truly amazing.

I was trying to disable etcd (save some resources), since I set the variable ETCD_HOST, but i could not find any option to do such a thing.
I read the configure_spilo.py and I saw that creating an empty etcd.conf file in supervisord it should not overwrite it unless when the option --force is specified.
I found out it was overwriting the file anyway, even if existing. I searched for the function write_file and apparently after dumping the warning, does not skip the writing part

def write_file(config, filename, overwrite):
    if not overwrite and os.path.exists(filename):
        logging.warning('File {} already exists, not overwriting. (Use option --force if necessary)'.format(filename))
    with open(filename, 'w') as f:
        logging.info('Writing to file {}'.format(filename))
        f.write(config)

So in the log I have

pg1_1  | 2017-09-27 09:54:44,887 - bootstrapping - WARNING - File /etc/supervisor/conf.d/etcd.conf already exists, not overwriting. (Use option --force if necessary)
pg1_1  | 2017-09-27 09:54:44,888 - bootstrapping - INFO - Writing to file /etc/supervisor/conf.d/etcd.conf

The version of Spilo I've been using is the spilo-9.6:1.2-p27.

Unable to use exhibitor

When specifying exhibitor as a host, patroni dies:

postgres@786112c34b6e:~$ patroni postgres.yml
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/dist-packages/patroni/__init__.py", line 71, in main
    patroni = Patroni(config)
  File "/usr/local/lib/python2.7/dist-packages/patroni/__init__.py", line 23, in __init__
    self.dcs = self.get_dcs(self.postgresql.name, config)
  File "/usr/local/lib/python2.7/dist-packages/patroni/__init__.py", line 37, in get_dcs
    return ZooKeeper(name, config['zookeeper'])
  File "/usr/local/lib/python2.7/dist-packages/patroni/zookeeper.py", line 83, in __init__
    self.exhibitor = ExhibitorEnsembleProvider(exhibitor['hosts'], exhibitor['port'], poll_interval=interval)
  File "/usr/local/lib/python2.7/dist-packages/patroni/zookeeper.py", line 32, in __init__
    while not self.poll():
  File "/usr/local/lib/python2.7/dist-packages/patroni/zookeeper.py", line 40, in poll
    json = self._query_exhibitors(self._exhibitors)
  File "/usr/local/lib/python2.7/dist-packages/patroni/zookeeper.py", line 55, in _query_exhibitors
    random.shuffle(exhibitors)
  File "/usr/lib/python2.7/random.py", line 289, in shuffle
    x[i], x[j] = x[j], x[i]
TypeError: 'str' object does not support item assignment

I assume that this is because exhibitor_hosts should be a list.

Unable to use PostGIS

With registry.opensource.zalan.do/acid/spilo-9.6:1.3-p4:

postgres=# CREATE DATABASE cbandy;
CREATE DATABASE
postgres=# \c cbandy
You are now connected to database "cbandy" as user "postgres".
cbandy=# CREATE EXTENSION postgis;
ERROR:  could not open extension control file "/usr/share/postgresql/9.6/extension/postgis.control": No such file or directory

# dpkg -l '*postgis*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                Version                   Architecture 
+++-===================================-=========================-=============
un  postgis                             <none>                    <none>       
ii  postgresql-9.3-postgis-2.3          2.3.3+dfsg-1.pgdg16.04+1  amd64        
un  postgresql-9.3-postgis-2.3-scripts  <none>                    <none>       
ii  postgresql-9.4-postgis-2.3          2.3.3+dfsg-1.pgdg16.04+1  amd64        
un  postgresql-9.4-postgis-2.3-scripts  <none>                    <none>       
ii  postgresql-9.5-postgis-2.3          2.3.3+dfsg-1.pgdg16.04+1  amd64        
un  postgresql-9.5-postgis-2.3-scripts  <none>                    <none>       
ii  postgresql-9.6-postgis-2.3          2.3.3+dfsg-1.pgdg16.04+1  amd64        
un  postgresql-9.6-postgis-2.3-scripts  <none>                    <none>

If I understand correctly, the postgresql-9.6-postgis-2.3-scripts package is required to utilize the PostGIS extension.

Etcd appliance: multiple stack creations possible with same STACK_VERSION

Hello,
I've noticed that it's possible to create multiple stacks with the same stack version (e.g. "etcd1") using etcd-cluster.yaml. This is strange since usually senza does not allow this.

The main point here is that after creating and deleting a cluster with the same version "etcd1", trying to list instances or events will result in an empty output.
Basically, running senza instances etcd-cluster.yaml etcd1 or senza events etcd-cluster.yaml etcd1 will result in nothing being shown. As if there is no cluster online with stack version "etcd1".

Sometimes pg_receivexlog --no-loop may fail when streaming from replica

pg_receivexlog: unexpected termination of replication stream: ERROR:  requested starting point 6008/9D000000 is ahead of the WAL flush position of this server 6008/9CE7E000

If we remove --no-loop argument it will work fine:

pg_receivexlog: unexpected termination of replication stream: ERROR:  requested starting point 6008/AC000000 is ahead of the WAL flush position of this server 6008/AB544000
pg_receivexlog: disconnected; waiting 5 seconds to try again

Needs maintainers file + contributor guidelines

Let choose default passwords via TaupageConfig

Right now they are hard-coded and changing the replica password is not straight forward, it would be nice to set them all up in the senza.yaml file perhaps inside TaupageConfig so it happens at spilo cluster creation time.

if you change the password in the config file, it is not in effect immediately, it will make it into recovery.conf at some later point (status change promote/demote of replica). you will also need to change the replication role's password on the server, as the role is created only once with the initial setup of the cluster

Allow setting crontabs

Spilo supports running cron jobs inside the container, as it uses supervisord to run multiple processes. At the moment, there is no way to define those cron jobs when starting the container. Add an environment variable and use it to write a crontab.

Rebuilding Dockerfile without caches breaks build

This post suggests installing setuptools will fix this:
http://stackoverflow.com/questions/35780537/error-no-module-named-markerlib-when-installing-some-packages-on-virtualenv

root@fcb9e8166231:/home/postgres# patroni
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 5, in <module>
    from pkg_resources import load_entry_point
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 2749, in <module>
    working_set = WorkingSet._build_master()
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 444, in _build_master
    ws.require(__requires__)
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 725, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 633, in resolve
    requirements.extend(dist.requires(req.extras)[::-1])
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 2291, in requires
    dm = self._dep_map
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 2484, in _dep_map
    self.__dep_map = self._compute_dependencies()
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 2501, in _compute_dependencies
    from _markerlib import compile as compile_marker
ImportError: No module named _markerlib

Finish the k8s branch for Spilo

We need to make sure our spilo startup script works with both AWS and K8s without one-time hacks.

Etcd appliance doesn't work with python3

Spilo reuses PATRONI_CONFIGURATION in order to supply custom configuration

I'm thinking about making some changes that may potentially break things for the already existing clusters: the variable PATRONI_CONFIGURATION that is set by the senza base template and used in Spilo in order to merge the default configuration with the custom one, happens to also be used by Patroni in order to configure itself entirely from the environment variables. So far, it doesn't break things for Patroni, because the environment is cleaned up at the point of executing Patroni from Spilo, but it does lead to issues when someone tries to execute patronictl inside the docker container under the default (root) user, since patronictl observes the partial PATRONI_CONFIGURATION set by Spilo and interprets it as a complete Patroni configuration, resulting in an exception, since some required parameters (namely, authentication related) are not set.

Therefore, in order to avoid breaking existing setups I will introduce the SPILO_CONFIGURATION variable, that will be examined alongside PATRONI_CONFIGURATION, and PATRONI_CONFIGURATION will be ignored if SPILO_CONFIGURATION is set.

pg_hba.conf does not allow connections from the load balancer

When trying to connect to a newly create cluster, my access is not allowed,

$ ssh -f -N -L 7543:lila.acid.example.com:5432 [email protected]
$ psql -h localhost -p 7543 -U standby -d postgres
psql: FATAL:  no pg_hba.conf entry for host "172.31.3.222", user "standby", database "postgres", SSL on
FATAL:  no pg_hba.conf entry for host "172.31.18.173", user "standby", database "postgres", SSL off
$ dig +noall +answer lila.acid.example.com
;; Warning: Message parser reports malformed message packet.
lila.acid.example.com.  3600    IN  CNAME   internal-spilo-lila-2078097691.eu-west-1.elb.amazonaws.com.
internal-spilo-lila-2078097691.eu-west-1.elb.amazonaws.com. 3600 IN A 172.31.15.137
internal-spilo-lila-2078097691.eu-west-1.elb.amazonaws.com. 3600 IN A 172.31.18.173
internal-spilo-lila-2078097691.eu-west-1.elb.amazonaws.com. 3600 IN A 172.31.3.222

spilo/patroni not able to elect new leader if previous leader, last working member failed due to full disk?

Scenario

GKE Kubernetes

spilo Pods via StatefulSet: patroni-set-0003

kind: StatefulSet
# [...]
metadata:
  name: patroni-set-0003
spec:
  replicas: 3
  # [...]
  template:
    spec:
      containers:
        - name: spilo
          # [...]
          env:
            - name: SCOPE
              value: the-scope
          volumeMounts:
            - mountPath: /home/postgres/pgdata
              name: pgdata
  volumeClaimTemplates:
    - metadata:
        name: pgdata
        
      spec:
        # [...]
        resources.requests.storage: 500Gi

Unfortunately, /home/postgres/pgdata ran out of space (in all pods, it seems,
probably almost simultaneously) and spilo/patroni started logging:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/patroni/async_executor.py", line 39, in run
    wakeup = func(*args) if args else func()
  File "/usr/local/lib/python3.5/dist-packages/patroni/postgresql.py", line 1067, in _do_follow
    self.write_recovery_conf(primary_conninfo)
  File "/usr/local/lib/python3.5/dist-packages/patroni/postgresql.py", line 911, in write_recovery_conf
    f.write("{0} = '{1}'\n".format(name, value))
OSError: [Errno 28] No space left on device

I believe the last leader before all pods went out of disk was
either patroni-set-0003-1 or patroni-set-0003-2.

Recovery

In order to solve the issue I:

Scaled down patroni-set-0003 to 1 replica (still failing with OSError: No
space left on device),
Note that this will leave me without any running old
leader, broken or not, I believe this could be a key to my issue.

Created a new StatefulSet, patroni-set-0004, with the same configuration as
patroni-set-0003 except

spec.metadata.name: patroni-set-0004
spec.replicas: 1
spec.volumeClaimTemplates[0].spec.resources.requests: 1Ti

With only the broken patroni-set-0003-0 running, patroni-set-0004-0 started
restoring from WAL archive, I left it overnight to restore. During this time
both patroni-set-0003-0 and patroni-set-0004-0 were running, but patroni-set-0003-0` was out of disk.

Several hours later, patroni-set-0004-0 was logging lots of:

following a different leader because i am not the healthiest node
Lock owner: None; I am patroni-set-0004-0
wal_e.blobstore.gs.utils WARNING MSG: could no longer locate object while performing wal restore
DETAIL: The absolute URI that could not be located is gs://the-bucket/spilo/the-scope/wal/wal_005/the-file.lzo.
HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
STRUCTURED: time=2017-06-26T12:05:23.646236-00 pid=207
lzop: <stdin>: not a lzop file
[...]

I expected patroni-set-0004-0 to take over the master lock by this time.

Debugging why the disk outage occured, I found out about ext Reserved
blocks, I then
recovered 25Gi of disk space on patroni-set-0003-0's pgdata by running
tune2fs -m 0 /dev/$PGDATA_DEV. I realize in hindsight that simply resizing
the GCE PD would have been easier.

However, once patroni-set-0003-0 was given extra space and restarted, it did
not seem willing to take the leader role even given the extra disk space and no
current leader, logging lots of:

Lock owner: None; I am patroni-set-0003-0
wal_e.blobstore.gs.utils WARNING MSG: could no longer locate object while performing wal restore
DETAIL: The absolute URI that could not be located is gs://the-bucket/spilo/the-scope/wal/wal_005/the-file.lzo.
HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
STRUCTURED: time=2017-06-26T12:05:23.646236-00 pid=207
lzop: <stdin>: not a lzop file
[...]

I expected patroni-set-0003-0 to take the leader role by this time.

I then did the same thing to patroni-set-0003-{1,2}, freeing up 25Gi of space.

Once patroni-set-0003-1 was given extra disk space and restarted it took the
master lock.

Helm chart: cannot login with admin credentials

It seems that there isn't a possible way to log into postgres with the "admin" user defined in the Kubernetes helm chart here.

By default here are the users once I log in with -U postgres password tea.

postgres=# \du
                                           List of roles
 Role name  |                         Attributes                         |        Member of
------------+------------------------------------------------------------+-------------------------
 admin      | Create role, Create DB, Cannot login                       | {}
 pgq_admin  | Cannot login                                               | {pgq_reader,pgq_writer}
 pgq_reader | Cannot login                                               | {}
 pgq_writer | Cannot login                                               | {}
 postgres   | Superuser, Create role, Create DB, Replication, Bypass RLS | {}
 robot_zmon | Cannot login                                               | {}
 standby    | Replication                                                | {}

It doesn't seem like there is an environment variable for PGPASSWORD_ADMIN either in the docs. Anywhere I can help with this issue I'd gladly open a PR as well.

Tune OS parameters, kernel parameters

Currently we don't tune any parameters of the underlying OS. This may result in error messages.

Today we received the following errors on a tiny database on a t2.medium:

LOG:  could not fork autovacuum worker process: Cannot allocate memory

We await more information to do better troubleshooting to find the cause of this one.

On plain kubernetes, host IP is passed as replication address

In a Kube install, for some reason the host IP is passed as the advertised API and replication address for each node. This, of course, doesn't work:

2016-11-06 08:45:23,702 INFO: trying to bootstrap from leader 'patroni_0'
pg_basebackup: could not connect to server: could not connect to server: Connection refused
        Is the server running on host "172.31.42.94" and accepting
        TCP/IP connections on port 5432?

I'm frankly not sure how it even has access to the host IP, let alone using it mistakenly. Will find out.

Large number of "reaped unknown pid " messages during restore

I'm seeing a lot of messages of this kind in the output from my spilo container when creating a replica from WAL-E basebackup (that's what I think I'm doing at least, I'm trying to make creating a new replica bug the leader as little as possible).

2017-04-27 13:41:35,442 INFO reaped unknown pid 27920

Cluster

+-----------+--------------------+--------------+--------+---------+-----------+
| Cluster   | Member             | Host         |  Role  |  State  | Lag in MB |
+-----------+--------------------+--------------+--------+---------+-----------+
| patroni-1 | patroni-set-0004-0 | 10.68.24.78  | Leader | running |         0 |
| patroni-1 | patroni-set-0004-1 | 10.68.23.108 |        | running |         0 |
| patroni-1 | patroni-set-0008-0 | 10.68.22.205 |        | running |     48525 | [1]
+-----------+--------------------+--------------+--------+---------+-----------+

[1]: I have relaxed the WALE_BACKUP_THRESHOLD_* quite much in order to be able to restore with WAL-E while the leader is running

Config

`postgres.yml`

bootstrap:
  # [...]
etcd:
  host: etcd-cluster
postgresql:
  # [...]
  basebackup_fast_xlog:
    command: /basebackup.sh
    retries: 2
  callbacks:
    on_restart: /callback_role.py
    on_role_change: /callback_role.py
    on_start: /callback_role.py
    on_stop: /callback_role.py
  connect_address: 10.68.22.205:5432
  create_replica_method:
  - wal_e
  - basebackup_fast_xlog
  data_dir: /home/postgres/pgdata/pgroot/data
  listen: 0.0.0.0:5432
  name: patroni-set-0008-0
  parameters:
    # [...]
  recovery_conf:
    restore_command: envdir "/home/postgres/etc/wal-e.d/env" /wale_restore_command.sh "%f" "%p"
  scope: patroni-1
  wal_e:
    command: patroni_wale_restore
    envdir: /home/postgres/etc/wal-e.d/env
    no_master: 1
    retries: 2
    threshold_backup_size_percentage: 50
    threshold_megabytes: 102400
    use_iam: 1
restapi:
  connect_address: 10.68.22.205:8008
  listen: 0.0.0.0:8008
scope: patroni-1

Logs

stderr/out

[...]
2017-04-27 13:41:31,376 INFO: Lock owner: patroni-set-0004-0; I am patroni-set-0008-0
2017-04-27 13:41:31,376 INFO: does not have lock
2017-04-27 13:41:31,386 INFO: no action.  i am a secondary and i am following a leader
2017-04-27 13:41:32,104 INFO reaped unknown pid 27856
2017-04-27 13:41:33,784 INFO reaped unknown pid 27917
2017-04-27 13:41:33,793 INFO reaped unknown pid 28028
2017-04-27 13:41:34,194 INFO reaped unknown pid 27922
2017-04-27 13:41:34,610 INFO reaped unknown pid 27921
2017-04-27 13:41:35,442 INFO reaped unknown pid 27920
[...]

`pgdata/pgroot/pg_log/postgresql-4.log`

[...]
wal_e.blobstore.gs.utils INFO     MSG: completed download and decompression
        DETAIL: Downloaded and decompressed "gs://my-postgres-backup/spilo/patroni-1/wal/wal_005/0000000A0000013F0000000E.lzo" to "pg_xlog/RECOVERYXLOG"
        STRUCTURED: time=2017-04-27T13:52:01.185621-00 pid=5857
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2017-04-27T13:52:01.186139-00 pid=5857 action=wal-fetch key=gs://my-postgres-backup/spilo/patroni-1/wal/wal_005/0000000A0000013F0000000E.lzo prefix=spilo/patroni-1/wal/ seg=0000000A0000013F0000000E state=complete
wal_e.operator.backup INFO     MSG: promoted prefetched wal segment
        STRUCTURED: time=2017-04-27T13:52:03.330436-00 pid=5927 action=wal-fetch key=gs://my-postgres-backup/spilo/patroni-1/wal/wal_005/0000000A0000013F00000012.lzo prefix=spilo/patroni-1/wal/ seg=0000000A0000013F00000012
wal_e.operator.backup INFO     MSG: promoted prefetched wal segment
        STRUCTURED: time=2017-04-27T13:52:05.169399-00 pid=5963 action=wal-fetch key=gs://my-postgres-backup/spilo/patroni-1/wal/wal_005/0000000A0000013F00000013.lzo prefix=spilo/patroni-1/wal/ seg=0000000A0000013F00000013
wal_e.operator.backup INFO     MSG: promoted prefetched wal segment
        STRUCTURED: time=2017-04-27T13:52:06.791310-00 pid=5999 action=wal-fetch key=gs://my-postgres-backup/spilo/patroni-1/wal/wal_005/0000000A0000013F00000015.lzo prefix=spilo/patroni-1/wal/ seg=0000000A0000013F00000015
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2017-04-27T13:52:08.914942-00 pid=6017 action=wal-fetch key=gs://my-postgres-backup/spilo/patroni-1/wal/wal_005/0000000A0000013F00000016.lzo prefix=spilo/patroni-1/wal/ seg=0000000A0000013F00000016 state=begin
wal_e.blobstore.gs.utils INFO     MSG: completed download and decompression
        DETAIL: Downloaded and decompressed "gs://my-postgres-backup/spilo/patroni-1/wal/wal_005/0000000A0000013F00000016.lzo" to "pg_xlog/RECOVERYXLOG"
        STRUCTURED: time=2017-04-27T13:52:10.087792-00 pid=6017
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2017-04-27T13:52:10.088184-00 pid=6017 action=wal-fetch key=gs://my-postgres-backup/spilo/patroni-1/wal/wal_005/0000000A0000013F00000016.lzo prefix=spilo/patroni-1/wal/ seg=0000000A0000013F00000016 state=complete
wal_e.operator.backup INFO     MSG: promoted prefetched wal segment
        STRUCTURED: time=2017-04-27T13:52:11.905947-00 pid=6057 action=wal-fetch key=gs://my-postgres-backup/spilo/patroni-1/wal/wal_005/0000000A0000013F00000017.lzo prefix=spilo/patroni-1/wal/ seg=0000000A0000013F00000017
wal_e.operator.backup INFO     MSG: promoted prefetched wal segment
        STRUCTURED: time=2017-04-27T13:52:13.466800-00 pid=6122 action=wal-fetch key=gs://my-postgres-backup/spilo/patroni-1/wal/wal_005/0000000A0000013F0000001C.lzo prefix=spilo/patroni-1/wal/ seg=0000000A0000013F0000001C
[...]

`pgdata/pgroot/pg_log/postgresql-4.csv`

[...]
2017-04-27 13:54:01.489 UTC,,,2030,,5901eeb0.7ee,1941,,2017-04-27 13:14:24 UTC,1/0,0,LOG,00000,"restored log file ""0000000A0000013F0000007D"" from archive",,,,,,,,,""
2017-04-27 13:54:01.605 UTC,,,2030,,5901eeb0.7ee,1942,,2017-04-27 13:14:24 UTC,1/0,0,LOG,00000,"restored log file ""0000000A0000013F0000007E"" from archive",,,,,,,,,""
2017-04-27 13:54:02.771 UTC,,,2114,,5901eeb6.842,215,,2017-04-27 13:14:30 UTC,,0,LOG,00000,"restartpoint complete: wrote 93215 buffers (19.1%); 1 transaction log file(s) added, 36 removed, 0 recycled; write=23.522 s, sync=0.060 s, total=24.188 s; sync files=35, longest=0.053 s, average=0.001 s; distance=605163 kB, estimate=605163 kB",,,,,,,,,""
2017-04-27 13:54:02.771 UTC,,,2114,,5901eeb6.842,216,,2017-04-27 13:14:30 UTC,,0,LOG,00000,"recovery restart point at 13F/44A688B8","last completed transaction was at log time 2017-04-27 05:45:19.956478+00",,,,,,,,""
[...]

Processes

root@patroni-set-0008-0:/home/postgres# ps auxf|cat
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       248  0.0  0.0  21820  4292 ?        Ss   12:47   0:00 bash
root     29402  0.0  0.0  37768  3280 ?        R+   13:42   0:00  \_ ps auxf
root     29403  0.0  0.0   7880   736 ?        S+   13:42   0:00  \_ cat
root         1  0.0  0.0  58124 15752 ?        Ss   12:41   0:03 /usr/bin/python /usr/bin/supervisord --configuration=/etc/supervisor/supervisord.conf --nodaemon
root        40  0.0  0.0  26068  2288 ?        S    12:41   0:00 /usr/sbin/cron -f
postgres    41  0.1  0.1 578480 32076 ?        Sl   12:41   0:04 /usr/bin/python3 /usr/local/bin/patroni /home/postgres/postgres.yml
postgres    42  0.0  0.0  18044  2800 ?        S    12:41   0:00 /bin/bash /patroni_wait.sh --role master -- /usr/bin/pgqd /home/postgres/pgq_ticker.ini
postgres 29211  0.0  0.0   4380   672 ?        S    13:42   0:00  \_ sleep 60
postgres  2028  0.0  0.4 4222144 141468 ?      S    13:14   0:00 postgres -D /home/postgres/pgdata/pgroot/data --wal_log_hints=on --max_locks_per_transaction=64 --hot_standby=on --port=5432 --listen_addresses=0.0.0.0 --max_prepared_transactions=0 --wal_level=replica --max_replication_slots=5 --wal_keep_segments=8 --max_connections=1000 --max_wal_senders=5 --cluster_name=patroni-1 --track_commit_timestamp=off --max_worker_processes=8
postgres  2029  0.0  0.0 158928  3352 ?        Ss   13:14   0:00  \_ postgres: patroni-1: logger process   
postgres  2030  6.0 10.6 4222212 3279180 ?     Rs   13:14   1:44  \_ postgres: patroni-1: startup process   recovering 0000000A0000013D00000045
postgres  2114  5.9 10.5 4222248 3260164 ?     Ds   13:14   1:42  \_ postgres: patroni-1: checkpointer process   
postgres  2115  0.0  0.1 4222144 32992 ?       Ss   13:14   0:00  \_ postgres: patroni-1: writer process   
postgres  2141  0.0  0.0 161048  3316 ?        Ss   13:14   0:00  \_ postgres: patroni-1: stats collector process   
postgres  2153  0.0  0.0 4224032 21504 ?       Ss   13:14   0:00  \_ postgres: patroni-1: postgres postgres 127.0.0.1(54200) idle
postgres 29287 48.4  0.0  74916 23200 ?        S    13:42   0:03 /usr/bin/python3 /usr/local/bin/wal-e --aws-instance-profile wal-prefetch /home/postgres/pgdata/pgroot/data/pg_xlog 0000000A0000013D00000047
postgres 29405  0.0  0.0  74916 15108 ?        R    13:42   0:00  \_ /usr/bin/python3 /usr/local/bin/wal-e --aws-instance-profile wal-prefetch /home/postgres/pgdata/pgroot/data/pg_xlog 0000000A0000013D00000047
postgres 29288 51.0  0.1 183860 35728 ?        Sl   13:42   0:03 /usr/bin/python3 /usr/local/bin/wal-e --aws-instance-profile wal-prefetch /home/postgres/pgdata/pgroot/data/pg_xlog 0000000A0000013D00000046
postgres 29399  0.0  0.0   6764  1564 ?        S    13:42   0:00  \_ lzop -d -c -
postgres 29335 47.1  0.0  55644 16312 ?        R    13:42   0:02 /usr/bin/python3 /usr/local/bin/wal-e --aws-instance-profile wal-prefetch /home/postgres/pgdata/pgroot/data/pg_xlog 0000000A0000013D00000049
postgres 29336 49.8  0.0  77004 23256 ?        R    13:42   0:02 /usr/bin/python3 /usr/local/bin/wal-e --aws-instance-profile wal-prefetch /home/postgres/pgdata/pgroot/data/pg_xlog 0000000A0000013D0000004A
postgres 29337 50.3  0.0  57596 18008 ?        R    13:42   0:03 /usr/bin/python3 /usr/local/bin/wal-e --aws-instance-profile wal-prefetch /home/postgres/pgdata/pgroot/data/pg_xlog 0000000A0000013D00000048
postgres 29369 54.6  0.0 107628 26772 ?        R    13:42   0:01 /usr/bin/python3 /usr/local/bin/wal-e --aws-instance-profile wal-fetch 0000000A0000013D00000044 pg_xlog/RECOVERYXLOG
postgres 29371 55.0  0.0 107628 26772 ?        R    13:42   0:01 /usr/bin/python3 /usr/local/bin/wal-e --aws-instance-profile wal-fetch 0000000A0000013D00000044 pg_xlog/RECOVERYXLOG
postgres 29393 46.0  0.0 107624 26672 ?        R    13:42   0:00 /usr/bin/python3 /usr/local/bin/wal-e --aws-instance-profile wal-fetch 0000000A0000013D00000045 pg_xlog/RECOVERYXLOG

WAL-E should run agains K8S

Right now, the WAL segments are not archived and base backups not created. We need to find an equivalent cloud storage on GCE/K8S and make WAL-E talk to it (the easiest way would be to find something that already speaks the S3 protocol).

Permission issue after running spilo.

FATAL: could not access private key file "/etc/ssl/private/ssl-cert-snakeoil.key": Permission denied