zalando-stups / stups-etcd-cluster Goto Github PK

Etcd cluster appliance for the STUPS (AWS) environment

License: Other

Python 97.36% Dockerfile 2.64%

stups-etcd-cluster's Introduction

Introduction

This etcd appliance is created for an AWS environment. It is available as an etcd cluster internally, for any application willing to use it. For discovery of the appliance we have a recently updated DNS SRV and A records in a Route53 zone.

Design

The appliance is supposed to be run on EC2 instances, members of one autoscaling group. Usage of autoscaling group give us possibility to discover all cluster member via AWS api (python-boto). Etcd process is executed by a python wrapper which is taking care of discovering all members of already existing cluster or the new cluster.

Currently the following scenarios are supported:

Starting up the new cluster. etcd.py will figure out that this is the new cluster and run etcd daemon with necessary options.
If the new EC2 instance is spawned within existing autoscaling group etcd.py will take care of adding this instance into already existing cluster and apply needed options to etcd daemon.
If something happened with etcd (crached or exited), etcd.py will try to restart it.
Periodically leader performs cluster health check and remove cluster members which are not members of autoscaling group
Also it creates or updates SRV and A records in a given zone via AWS api.

Usage

Step 1: Create an etcd cluster

A cluster can be created by issuing such a command:

senza create etcd-cluster.yaml STACK_VERSION HOSTED_ZONE DOCKER_IMAGE

For example, if you are making an etcd cluster to be used by a service called release, you could issue the following:

senza create https://raw.githubusercontent.com/zalando-stups/stups-etcd-cluster/master/etcd-cluster.yaml releaseetcd \
                               HostedZone=elephant.example.org \
                               DockerImage=registry.opensource.zalan.do/acid/etcd-cluster:3.0.17-p17

Step 2: Confirm successful cluster creation

Running this senza create command should have created:

the required number of EC2 instances
- with stack name etcd-cluster
- with instance name etcd-cluster-releaseetcd
a security group allowing etcd's ports 2379 and 2380
a role that allows List and Describe EC2 resources and create records in a Route53
DNS records
- an A record of the form releaseetcd.elephant.example.org.
- a SRV record of the form _etcd-server._tcp.releaseetcd.elephant.example.org. with port = 2380, i.e. peer port
- a SRV record of the form _etcd._tcp.releaseetcd.elephant.example.org. with port = 2379, i.e. client port

Multiregion cluster

It is possible to deploy etcd-cluster across multiple regions. To do that you have to deploy cloud formation stack into multiple regions with the same stack names. This enables discovery of instances from other regions and grants access to those instances via SecurityGroups. Deployment has to be done region by region, otherwise there is a chance of race condition during cluster bootstrap.

senza --region eu-central-1 create etcd-cluster-multiregion.yaml multietcd
        HostedZone=elephant.example.org \
        DockerImage=registry.opensource.zalan.do/acid/etcd-cluster:3.0.17-p17 \
        ActiveRegions=eu-west-1,eu-central-1 \
        InstanceCount=4

senza --region eu-central-1 wait etcd-cluster multietcd

senza --region eu-west-1 create etcd-cluster-multiregion.yaml multietcd
        HostedZone=elephant.example.org \
        DockerImage=registry.opensource.zalan.do/acid/etcd-cluster:3.0.17-p17 \
        ActiveRegions=eu-west-1,eu-central-1 \
        InstanceCount=1

Upgrade

In order to perform a minor or major upgrade without downtime you need to terminate all EC2 instances one-by-one. Between every termination you need to wait at least 5 minutes and monitor cluster-health, logs and DNS records. You should only terminate the next instance if the cluster is healthy again.

To upgrade an existing etcd deployment to 3.0, you must be running 2.3. If you are running a version of etcd before 2.3, you must upgrade to 2.3 (preferably 2.3.7) before upgrading to 3.0.

A major upgrade is possible one version at a time, i.e. it is possible to upgrade from 2.0 to 2.1 and from 2.1 to 2.2, but it is not possible to upgrade from 2.0 to 2.2.

Before 3.0 it was possible simply to "join" the new member with a higher major version with the empty data directory to the cluster and it was working fine. Somehow this approach has stopped working for 2.3 -> 3.0 upgrade. So now we are using another technique: if the cluster_version is still 2.3, we are "joining" etcd 2.3.7 member to the cluster, in order to download latest data. When the cluster becomes healthy again, we are taking an "upgrade_lock", stopping etcd 2.3.7 and starting up etcd 3.0. When the cluster is healthy again we are removing "upgrade_lock" in order for other members to upgrade.

The upgrade lock is needed to:

Temporary switch off "house-keeping" job, which task is removing "unhealthy" members and updating DNS records.
Make sure that we are upgrading one cluster member at a time.

Migration of an existing cluster to multiregion setup

Currently there are only two AZ in eu-central-1 region, therefore if the one AZ will go down we have a 50% chance that our etcd will become read-only. To avoid that we want to run one additional instance in eu-west-1 region.

Step 1: you have to migrate to the multiregion setup but with only 1 (ONE) active region eu-central-1. To do that you need to run:

senza --region=eu-central-1 update etcd-cluster-multiregion.yaml existingcluster \
        HostedZone=elephant.example.org \
        DockerImage=registry.opensource.zalan.do/acid/etcd-cluster:3.0.17-p17 \
        ActiveRegions=eu-central-1 \
        InstanceCount=5

And do instance rotation like during Upgrade procedure.

Step 2: Enable the second region.

senza --region=eu-central-1 update etcd-cluster-multiregion.yaml existingcluster \
        HostedZone=elephant.example.org \
        DockerImage=registry.opensource.zalan.do/acid/etcd-cluster:3.0.17-p17 \
        ActiveRegions=eu-central-1,eu-west-1 \
        InstanceCount=5

And rotate all instances once again. Although the second region is not there yet, cluster will think that it is working in multiregion mode.

Step 3: Change instance count in eu-central-1 to 4:

senza --region=eu-central-1 update etcd-cluster-multiregion.yaml existingcluster \
        HostedZone=elephant.example.org \
        DockerImage=registry.opensource.zalan.do/acid/etcd-cluster:3.0.17-p17 \
        ActiveRegions=eu-central-1,eu-west-1 \
        InstanceCount=4

Autoscaling will kill one of the instances automatically.

Step 4: Deploy cloudformation in another region:

senza --region eu-west-1 create etcd-cluster-multiregion.yaml existingcluster
        HostedZone=elephant.example.org \
        DockerImage=registry.opensource.zalan.do/acid/etcd-cluster:3.0.17-p17 \
        ActiveRegions=eu-west-1,eu-central-1 \
        InstanceCount=1

Demo

stups-etcd-cluster's People

Contributors

Stargazers

Watchers

Forkers

snackycracky cvirus akauppi open-source-archive rojosewe linki antban szuecs

stups-etcd-cluster's Issues

HouseKeeper: Type str doesn't support the buffer API

06:26:12.000 Jan 11 05:26:12 ip-172-31-166-204 docker/a9251b73b92c[785]: ERROR 2016-01-11 05:26:12,831 - Exception in HouseKeeper main loop
06:26:12.000 Jan 11 05:26:12 ip-172-31-166-204 docker/a9251b73b92c[785]: Traceback (most recent call last):
06:26:12.000 Jan 11 05:26:12 ip-172-31-166-204 docker/a9251b73b92c[785]: File "/bin/etcd.py", line 433, in run
06:26:12.000 Jan 11 05:26:12 ip-172-31-166-204 docker/a9251b73b92c[785]: if (update_required or self.members_changed() or self.cluster_unhealthy()) and self.acquire_lock():
06:26:12.000 Jan 11 05:26:12 ip-172-31-166-204 docker/a9251b73b92c[785]: File "/bin/etcd.py", line 386, in cluster_unhealthy
06:26:12.000 Jan 11 05:26:12 ip-172-31-166-204 docker/a9251b73b92c[785]: ret = any([True for line in process.stdout if 'is unhealthy' in line])
06:26:12.000 Jan 11 05:26:12 ip-172-31-166-204 docker/a9251b73b92c[785]: File "/bin/etcd.py", line 386, in
06:26:12.000 Jan 11 05:26:12 ip-172-31-166-204 docker/a9251b73b92c[785]: ret = any([True for line in process.stdout if 'is unhealthy' in line])
06:26:12.000 Jan 11 05:26:12 ip-172-31-166-204 docker/a9251b73b92c[785]: TypeError: Type str doesn't support the buffer API
06:26:12.000 Jan 11 05:26:12 ip-172-31-166-204 docker/a9251b73b92c[785]: DEBUG 2016-01-11 05:26:12,833 - Sleeping 30 seconds...

Question regarding upgrades

Hi,

I find the upgrade instructions a bit vague. It says:

In order to perform a minor or major upgrade without downtime you need to terminate all EC2 instances one-by-one.

So the suggestion is to do a "rolling update", if I understand correctly. The senza "version" would remain the same, as this is used as a service identifier / DNS entry in this case that should remain stable.
Could you elaborate how such a rolling update would be performed? The quote above is confusing, as terminating EC2 instances by hand would just result in new ones being created, because they belong to the same autoscaling group and it would attempt to maintain the number of nodes.
So you'd have to manually add an EC2 instance with a newer etcd version first, and then remove one with the old version?

Could you maybe elaborate a bit on, or even provide an example for, the update of the etcd cluster?

Migrate to new Boto3 AWS Python SDK

The old boto library is deprecated.
https://boto3.readthedocs.org/en/latest/guide/migration.html

Support for CoreOS Container Linux instead of Taupage AMI

We are running the etcd appliance for our Kubernetes setup and would like to switch to CoreOS Container Linux as it would remove one dependency for us (Kubernetes nodes already run Container Linux).

It should be relatively simple to provide another Cloud Formation / Senza file with CoreOS Container Linux instead of Taupage (in the end we are just running a Docker container).

Member fails to recover

Our cluster has an unhealthy node which keeps printing the following in the logs. Let me know if you need more information.

Jun 28 11:58:29 ip-172-31-131-225 docker/27d00dc9d174[820]: INFO   2017-06-28 11:58:29,639 - Calling paginated ec2:describe_instances with {'Filters': [{'Values': ['etcd-cluster-etcd'], 'Name': 'tag:aws:cloudformation:stack-name'}]}
Jun 28 11:58:29 ip-172-31-131-225 docker/27d00dc9d174[820]: INFO   2017-06-28 11:58:29,650 - Starting new HTTPS connection (1): ec2.eu-central-1.amazonaws.com
Jun 28 11:58:29 ip-172-31-131-225 docker/27d00dc9d174[820]: INFO   2017-06-28 11:58:29,738 - Starting new HTTP connection (1): 172.31.131.127
Jun 28 11:58:29 ip-172-31-131-225 docker/27d00dc9d174[820]: INFO   2017-06-28 11:58:29,754 - Starting new HTTP connection (1): 172.31.131.127
Jun 28 11:58:29 ip-172-31-131-225 docker/27d00dc9d174[820]: INFO   2017-06-28 11:58:29,757 - Starting new HTTP connection (1): 172.31.131.127
Jun 28 11:58:29 ip-172-31-131-225 docker/27d00dc9d174[820]: INFO   2017-06-28 11:58:29,759 - My clientURLs list is not empty: ['http://172.31.131.225:2379']
Jun 28 11:58:29 ip-172-31-131-225 docker/27d00dc9d174[820]: INFO   2017-06-28 11:58:29,759 - My data directory exists=True
Jun 28 11:58:29 ip-172-31-131-225 docker/27d00dc9d174[820]: INFO   2017-06-28 11:58:29,761 - Started new /bin/etcd process with pid: 444 and args: ['-name', 'i-05982c02c444ea551', '--data-dir', 'data', '-listen-peer-urls', 'http://0.0.0.0:2380', '-initial-advertise-peer-urls', 'http://172.31.131.225:2380', '-listen-client-urls', 'http://0.0.0.0:2379', '-advertise-client-urls', 'http://172.31.131.225:2379', '-initial-cluster', 'i-05982c02c444ea551=http://172.31.131.225:2380,i-06cda1ba0307bcd19=http://172.31.140.202:2380,i-06ed4eb078b00f623=http://172.31.153.93:2380,i-077dfb38fc0039342=http://172.31.159.46:2380,i-0a665d73f1d8c7cfa=http://172.31.131.127:2380', '-initial-cluster-token', 'etcd-cluster-etcd', '-initial-cluster-state', 'existing']
Jun 28 11:58:29 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:58:29.791434 W | flags: unrecognized environment variable ETCDVERSION=3.0.15
Jun 28 11:58:29 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:58:29.791768 I | etcdmain: etcd Version: 3.0.15
Jun 28 11:58:29 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:58:29.791939 I | etcdmain: Git SHA: fc00305
Jun 28 11:58:29 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:58:29.792099 I | etcdmain: Go Version: go1.6.3
Jun 28 11:58:29 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:58:29.792262 I | etcdmain: Go OS/Arch: linux/amd64
Jun 28 11:58:29 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:58:29.792433 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
Jun 28 11:58:29 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:58:29.792618 N | etcdmain: the server is already initialized as member before, starting as etcd member...
Jun 28 11:58:29 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:58:29.792891 I | etcdmain: listening for peers on http://0.0.0.0:2380
Jun 28 11:58:29 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:58:29.793147 I | etcdmain: listening for client requests on 0.0.0.0:2379
Jun 28 11:58:35 ip-172-31-131-225 docker/27d00dc9d174[820]: INFO   2017-06-28 11:58:35,505 - Starting new HTTP connection (1): 172.31.131.225
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:58:38.301983 I | etcdserver: recovered store from snapshot at index 222437884
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:58:38.302392 I | etcdserver: name = i-05982c02c444ea551
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:58:38.302576 I | etcdserver: data dir = data
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:58:38.302734 I | etcdserver: member dir = data/member
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:58:38.302895 I | etcdserver: heartbeat = 100ms
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:58:38.303048 I | etcdserver: election = 1000ms
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:58:38.303180 I | etcdserver: snapshot count = 10000
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:58:38.303341 I | etcdserver: advertise client URLs = http://172.31.131.225:2379
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: ERROR  2017-06-28 11:58:38,616 - Exception in HouseKeeper main loop
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: Traceback (most recent call last):
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 377, in _make_request
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     httplib_response = conn.getresponse(buffering=True)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: TypeError: getresponse() got an unexpected keyword argument 'buffering'
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: During handling of the above exception, another exception occurred:
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: Traceback (most recent call last):
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 379, in _make_request
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     httplib_response = conn.getresponse()
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/usr/lib/python3.5/http/client.py", line 1197, in getresponse
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     response.begin()
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/usr/lib/python3.5/http/client.py", line 297, in begin
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     version, status, reason = self._read_status()
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/usr/lib/python3.5/http/client.py", line 258, in _read_status
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/usr/lib/python3.5/socket.py", line 575, in readinto
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     return self._sock.recv_into(b)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: socket.timeout: timed out
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: During handling of the above exception, another exception occurred:
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: Traceback (most recent call last):
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/usr/lib/python3/dist-packages/requests/adapters.py", line 376, in send
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     timeout=timeout
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 610, in urlopen
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     _stacktrace=sys.exc_info()[2])
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 247, in increment
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     raise six.reraise(type(error), error, _stacktrace)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/usr/lib/python3/dist-packages/six.py", line 686, in reraise
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     raise value
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 560, in urlopen
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     body=body, headers=headers)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 381, in _make_request
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 309, in _raise_timeout
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: requests.packages.urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='172.31.131.225', port=2379): Read timed out. (read timeout=3.1)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: During handling of the above exception, another exception occurred:
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: Traceback (most recent call last):
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/bin/etcd.py", line 586, in run
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     if self.manager.etcd_pid != 0 and self.is_leader():
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/bin/etcd.py", line 497, in is_leader
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     return self.manager.me.is_leader()
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/bin/etcd.py", line 194, in is_leader
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     return not self.api_get('stats/leader') is None
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/bin/etcd.py", line 163, in api_get
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     response = requests.get(url, timeout=self.API_TIMEOUT)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/usr/lib/python3/dist-packages/requests/api.py", line 67, in get
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     return request('get', url, params=params, **kwargs)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/usr/lib/python3/dist-packages/requests/api.py", line 53, in request
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     return session.request(method=method, url=url, **kwargs)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/usr/lib/python3/dist-packages/requests/sessions.py", line 468, in request
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     resp = self.send(prep, **send_kwargs)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/usr/lib/python3/dist-packages/requests/sessions.py", line 576, in send
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     r = adapter.send(request, **kwargs)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:   File "/usr/lib/python3/dist-packages/requests/adapters.py", line 449, in send
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:     raise ReadTimeout(e, request=request)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: requests.exceptions.ReadTimeout: HTTPConnectionPool(host='172.31.131.225', port=2379): Read timed out. (read timeout=3.1)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:58:38.899646 I | etcdserver: restarting member 53733ad8236d11ce in cluster 5cee1e413d02 at commit index 222455528
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:58:38.900135 C | raft: 53733ad8236d11ce state.commit 222455528 is out of range [222437884, 222438776]
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: panic: 53733ad8236d11ce state.commit 222455528 is out of range [222437884, 222438776]
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]:
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: goroutine 1 [running]:
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: panic(0xd450a0, 0xc835481250)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/usr/local/go/src/runtime/panic.go:481 +0x3e6
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc820190620, 0x12375a0, 0x2b, 0xc832070f40, 0x4, 0x4)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x191
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).loadState(0xc83639f450, 0x1b334, 0x88b67ac3e24bbdb2, 0xd4266e8, 0x0, 0x0, 0x0)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:942 +0x2a2
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.newRaft(0xc8201078d8, 0x0)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:225 +0x8ff
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.RestartNode(0xc8201078d8, 0x0, 0x0)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/node.go:213 +0x45
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.restartNode(0xc82006fb80, 0xc820066000, 0x29, 0xc820107d78, 0x0, 0x0, 0x0, 0x0)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/raft.go:369 +0x7c7
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0xc82006fb80, 0x0, 0x0, 0x0)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:348 +0x430e
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc8201a8000, 0x0, 0x0, 0x0)
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:374 +0x245f
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:116 +0x2101
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:36 +0x21e
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: main.main()
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/main.go:28 +0x14
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: WARNING 2017-06-28 11:58:38,933 - Process 444 finished with exit code 2
Jun 28 11:58:38 ip-172-31-131-225 docker/27d00dc9d174[820]: WARNING 2017-06-28 11:58:38,933 - Sleeping 30 seconds before next try...
Jun 28 11:59:08 ip-172-31-131-225 docker/27d00dc9d174[820]: INFO   2017-06-28 11:59:08,944 - Calling paginated ec2:describe_instances with {'Filters': [{'Values': ['etcd-cluster-etcd'], 'Name': 'tag:aws:cloudformation:stack-name'}]}
Jun 28 11:59:08 ip-172-31-131-225 docker/27d00dc9d174[820]: INFO   2017-06-28 11:59:08,959 - Starting new HTTPS connection (1): ec2.eu-central-1.amazonaws.com
Jun 28 11:59:09 ip-172-31-131-225 docker/27d00dc9d174[820]: INFO   2017-06-28 11:59:09,067 - Starting new HTTP connection (1): 172.31.131.127
Jun 28 11:59:09 ip-172-31-131-225 docker/27d00dc9d174[820]: INFO   2017-06-28 11:59:09,071 - Starting new HTTP connection (1): 172.31.131.127
Jun 28 11:59:09 ip-172-31-131-225 docker/27d00dc9d174[820]: INFO   2017-06-28 11:59:09,074 - Starting new HTTP connection (1): 172.31.131.127
Jun 28 11:59:09 ip-172-31-131-225 docker/27d00dc9d174[820]: INFO   2017-06-28 11:59:09,076 - My clientURLs list is not empty: ['http://172.31.131.225:2379']
Jun 28 11:59:09 ip-172-31-131-225 docker/27d00dc9d174[820]: INFO   2017-06-28 11:59:09,076 - My data directory exists=True
Jun 28 11:59:09 ip-172-31-131-225 docker/27d00dc9d174[820]: INFO   2017-06-28 11:59:09,077 - Started new /bin/etcd process with pid: 450 and args: ['-name', 'i-05982c02c444ea551', '--data-dir', 'data', '-listen-peer-urls', 'http://0.0.0.0:2380', '-initial-advertise-peer-urls', 'http://172.31.131.225:2380', '-listen-client-urls', 'http://0.0.0.0:2379', '-advertise-client-urls', 'http://172.31.131.225:2379', '-initial-cluster', 'i-05982c02c444ea551=http://172.31.131.225:2380,i-06cda1ba0307bcd19=http://172.31.140.202:2380,i-06ed4eb078b00f623=http://172.31.153.93:2380,i-077dfb38fc0039342=http://172.31.159.46:2380,i-0a665d73f1d8c7cfa=http://172.31.131.127:2380', '-initial-cluster-token', 'etcd-cluster-etcd', '-initial-cluster-state', 'existing']
Jun 28 11:59:09 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:59:09.108064 W | flags: unrecognized environment variable ETCDVERSION=3.0.15
Jun 28 11:59:09 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:59:09.108416 I | etcdmain: etcd Version: 3.0.15
Jun 28 11:59:09 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:59:09.108593 I | etcdmain: Git SHA: fc00305
Jun 28 11:59:09 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:59:09.108771 I | etcdmain: Go Version: go1.6.3
Jun 28 11:59:09 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:59:09.108929 I | etcdmain: Go OS/Arch: linux/amd64
Jun 28 11:59:09 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:59:09.109088 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
Jun 28 11:59:09 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:59:09.109278 N | etcdmain: the server is already initialized as member before, starting as etcd member...
Jun 28 11:59:09 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:59:09.109530 I | etcdmain: listening for peers on http://0.0.0.0:2380
Jun 28 11:59:09 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:59:09.109792 I | etcdmain: listening for client requests on 0.0.0.0:2379
Jun 28 11:59:17 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:59:17.859842 I | etcdserver: recovered store from snapshot at index 222437884
Jun 28 11:59:17 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:59:17.860265 I | etcdserver: name = i-05982c02c444ea551
Jun 28 11:59:17 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:59:17.860450 I | etcdserver: data dir = data
Jun 28 11:59:17 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:59:17.860609 I | etcdserver: member dir = data/member
Jun 28 11:59:17 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:59:17.860753 I | etcdserver: heartbeat = 100ms
Jun 28 11:59:17 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:59:17.860895 I | etcdserver: election = 1000ms
Jun 28 11:59:17 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:59:17.861047 I | etcdserver: snapshot count = 10000
Jun 28 11:59:17 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:59:17.861212 I | etcdserver: advertise client URLs = http://172.31.131.225:2379
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:59:18.446568 I | etcdserver: restarting member 53733ad8236d11ce in cluster 5cee1e413d02 at commit index 222455528
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: 2017-06-28 11:59:18.447078 C | raft: 53733ad8236d11ce state.commit 222455528 is out of range [222437884, 222438776]
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: panic: 53733ad8236d11ce state.commit 222455528 is out of range [222437884, 222438776]
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]:
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: goroutine 1 [running]:
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: panic(0xd450a0, 0xc8355fb560)
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/usr/local/go/src/runtime/panic.go:481 +0x3e6
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc820190620, 0x12375a0, 0x2b, 0xc833422f40, 0x4, 0x4)
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x191
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).loadState(0xc836564270, 0x1b334, 0x88b67ac3e24bbdb2, 0xd4266e8, 0x0, 0x0, 0x0)
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:942 +0x2a2
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.newRaft(0xc8201078d8, 0x0)
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:225 +0x8ff
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.RestartNode(0xc8201078d8, 0x0, 0x0)
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/node.go:213 +0x45
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.restartNode(0xc82006fb80, 0xc820066000, 0x29, 0xc820107d78, 0x0, 0x0, 0x0, 0x0)
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/raft.go:369 +0x7c7
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0xc82006fb80, 0x0, 0x0, 0x0)
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:348 +0x430e
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc8201a8000, 0x0, 0x0, 0x0)
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:374 +0x245f
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:116 +0x2101
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:36 +0x21e
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: main.main()
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: #011/home/gyuho/go/src/github.com/coreos/etcd/cmd/main.go:28 +0x14
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: WARNING 2017-06-28 11:59:18,481 - Process 450 finished with exit code 2
Jun 28 11:59:18 ip-172-31-131-225 docker/27d00dc9d174[820]: WARNING 2017-06-28 11:59:18,481 - Sleeping 30 seconds before next try...

Mentioned Docker image 3.0.12 does not exist

Check https://registry.opensource.zalan.do/teams/acid/artifacts/etcd-cluster/tags

$ docker run -it registry.opensource.zalan.do/acid/etcd-cluster:3.0.12-p12
Unable to find image 'registry.opensource.zalan.do/acid/etcd-cluster:3.0.12-p12' locally
Pulling repository registry.opensource.zalan.do/acid/etcd-cluster
docker: Tag 3.0.12-p12 not found in repository registry.opensource.zalan.do/acid/etcd-cluster.

Ensure we do not create a cluster with no quorum

During initializing etcd.py registers the member to the etcd-cluster.
It can however happen that after registration etcd is not starting correctly.

If this happens multiple times, we can get into the situation that the cluster has no quorum

Example:

3 node cluster, 3 new nodes added

Members: 6, quorum: 3, but no new correct nodes.

Perhaps we should implement strict reconfig check:

https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#strict-reconfiguration-check-mode--strict-reconfig-check

Seems to be fixed in etcd-io/etcd@6974fc6

More than one etcd cluster per AWS account

More a question than an issue, at least for now...
Is my understanding correct that you can't have more than 1 etcd cluster per AWS account by default?

I'm looking at the code here and it seems that Route53 entries are sort of hardcoded in etcd.py which means that every deployment of a new cluster will overwrite entries for old ones? They're parametrised, but parameters are stack version and hosted zone, but not with a stack name.

    record_name = '_etcd-server._tcp.{}.{}'.format(stack_version, self.hosted_zone)
    (...)
    record_name = '_etcd._tcp.{}.{}'.format(stack_version, self.hosted_zone)
    (...)
    self.update_record(conn, zone_id, 'A', 'etcd-server.{}.{}'.format(stack_version, self.hosted_zone), new_record)

Is my understanding correct? Is that intended?
What if I want to have 2-3 completely separate stacks per AWS account (e.g. dev, staging, prod)?
Workaround seems to be using non-clashing versions even for different stack names, but it seems to be error prone...

etcd-cluster produces violations

Hi,

We're currently using stups-etcd-cluster and unfortunately It produces violations in our aws account. e.g. MISSING_SPEC_LINKS

MISSING_SPEC_LINKS   application_id: etcd-cluster, invalid_commits: aadf124708e14ab7e34fae2dca0f719d45a87721,     (Upgrade docker base image to ...), repository: 'https://github.com/zalando/stups-etcd-cluster.git'

Do you have an idea how to fix the violation? and If you can fix the issue please fix it

Best,
Yuthasak

etcd has newer minor version: 2.3.7

Support for indiviual A records (etcd-%d.example.org)

Currently there is support for service discovery through SRV records. This is somewhat working with the etcd2 proxy in Container Linux, but seem to be faced out with etcd3.
There is support for it in the etcd3 gateway/proxy but there has been no updates for a long time and AFAIK it doesn't even work right now as it will generate a list of endpoints with the wrong format: [endpoint1.:2379,endpoint2.:2379].
It's an easy fix, but shows that no one is really using it.

A more common setup is to pass the application or etcd3 gateway a list of endpoints. So I'm wondering if this could be supported in the appliance?

Essentially it should just create individual A records:

etcd-0.example.org   10.0.0.1
etcd-1.example.org   10.0.0.2
etcd-2.example.org   10.0.0.3

Which could then be configured in the etcd3 gateway:

etcd gateway start \
  --listen-addr=127.0.0.1:2379 \
  --endpoints=etcd-0.example.org:2379,etcd-1.example.org:2379,etcd-2.example.org:2379

Is it as simple as just creating these records, or is there more to it?

/cc @CyberDem0n

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.