Per @bacongobbler's request, pulling this out of #64 because it's different, yo.
TL;DR:
Because of moby/moby#26492, which is symptomatic in coreos/bugs#1554, the deis-registry
throws numerous dial tcp <in-cluster ip>:9000: i/o timeout
errors when trying to store docker images to minio
, which may or may not surface in the deis-controller
logs as either dockerclient
exceptions with the same error, or gunicorn
exceptions with a [CRITICAL] WORKER TIMOUT
error.
From what I can tell, ANY systemd-based host running docker 1.12.1 in a kubernetes cluster is at risk of suffering from this bug. It only LOOKS like an etcd
issue.
Top-Down (from helmc + deis -> kubernetes -> Docker + CoreOS)
Because that's the road I took, and you get to come along the scenic route with me.
It should be noted that while things are working fully, in many cases kubernetes appears "just fine" until you do a little digging.
All of my testing where I ran into this problem was against the Alpha CoreOS channel:
export KUBERNETES_PROVIDER=libvirt-coreos && export NUM_NODES=2 or 3 or 4
./cluster/kube-up.sh
# wait for etcd to settle
helmc install workflow-v2.5.0
# wait for kubernetes cluster to all be ready
deis pull <private registry>:5000/aergo-server:1.0.0-SNAPSHOT -a aergo-server
helmc + deis
Once your kubernetes cluster is up, during helmc install workflow-v2.5.0
the manifest may successfully deploy, or there could be Error from server: error when creating "STDIN": client: etcd cluster is unavailable or misconfigured.
Once deis workflow has been deployed, the symptom is the following two errors found in the controller logs when pushing a docker image into the registry. All of which seem to stem from an inability to connect to the minio
server:
dockerclient
dial i/o timeout
INFO [aergo-server]: build aergo-server-11b3c2a created
INFO [aergo-server]: rbellamy deployed 192.168.57.10:5000/aergo-server:1.0.0-SNAPSHOT
INFO Pulling Docker image 192.168.57.10:5000/aergo-server:1.0.0-SNAPSHOT
INFO Tagging Docker image 192.168.57.10:5000/aergo-server:1.0.0-SNAPSHOT as localhost:5555/aergo-server:v2
INFO Pushing Docker image localhost:5555/aergo-server:v2
INFO Pushing Docker image localhost:5555/aergo-server:v2
INFO Pushing Docker image localhost:5555/aergo-server:v2
INFO [aergo-server]: dial tcp 10.11.28.91:9000: i/o timeout
ERROR:root:dial tcp 10.11.28.91:9000: i/o timeout
Traceback (most recent call last):
File "/app/api/models/release.py", line 88, in new
release.publish()
File "/app/api/models/release.py", line 135, in publish
publish_release(source_image, self.image, deis_registry, self.get_registry_auth())
File "/app/registry/dockerclient.py", line 199, in publish_release
return DockerClient().publish_release(source, target, deis_registry, creds)
File "/app/registry/dockerclient.py", line 117, in publish_release
self.push("{}/{}".format(self.registry, name), tag)
File "/usr/local/lib/python3.5/dist-packages/backoff.py", line 286, in retry
ret = target(*args, **kwargs)
File "/app/registry/dockerclient.py", line 135, in push
log_output(stream, 'push', repo, tag)
File "/app/registry/dockerclient.py", line 178, in log_output
stream_error(chunk, operation, repo, tag)
File "/app/registry/dockerclient.py", line 195, in stream_error
raise RegistryException(message)
registry.dockerclient.RegistryException: dial tcp 10.11.28.91:9000: i/o timeout
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/app/api/models/build.py", line 62, in create
source_version=self.version
File "/app/api/models/release.py", line 95, in new
raise DeisException(str(e)) from e
api.exceptions.DeisException: dial tcp 10.11.28.91:9000: i/o timeout
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/rest_framework/views.py", line 471, in dispatch
response = handler(request, *args, **kwargs)
File "/app/api/views.py", line 181, in create
return super(AppResourceViewSet, self).create(request, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/rest_framework/mixins.py", line 21, in create
self.perform_create(serializer)
File "/app/api/viewsets.py", line 21, in perform_create
self.post_save(obj)
File "/app/api/views.py", line 258, in post_save
self.release = build.create(self.request.user)
File "/app/api/models/build.py", line 71, in create
raise DeisException(str(e)) from e
api.exceptions.DeisException: dial tcp 10.11.28.91:9000: i/o timeout
10.10.2.8 "POST /v2/apps/aergo-server/builds/ HTTP/1.1" 400 51 "Deis Client v2.5.1"
gunicorn
[CRITICAL] WORKER TIMEOUT
INFO [aergo-server]: build aergo-server-c09bb9b created
INFO [aergo-server]: rbellamy deployed 192.168.57.10:5000/aergo-server:1.0.0-SNAPSHOT
INFO Pulling Docker image 192.168.57.10:5000/aergo-server:1.0.0-SNAPSHOT
INFO Tagging Docker image 192.168.57.10:5000/aergo-server:1.0.0-SNAPSHOT as localhost:5555/aergo-server:v4
INFO Pushing Docker image localhost:5555/aergo-server:v4
INFO Pushing Docker image localhost:5555/aergo-server:v4
10.10.2.8 "GET /v2/apps/aergo-server/logs HTTP/1.1" 200 1284 "Deis Client v2.5.1"
INFO Pushing Docker image localhost:5555/aergo-server:v4
[2016-09-21 16:05:50 +0000] [24] [CRITICAL] WORKER TIMEOUT (pid:37)
[2016-09-21 16:05:50 +0000] [37] [WARNING] worker aborted
File "/usr/local/bin/gunicorn", line 11, in <module>
sys.exit(run())
File "/usr/local/lib/python3.5/dist-packages/gunicorn/app/wsgiapp.py", line 74, in run
WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
File "/usr/local/lib/python3.5/dist-packages/gunicorn/app/base.py", line 192, in run
super(Application, self).run()
File "/usr/local/lib/python3.5/dist-packages/gunicorn/app/base.py", line 72, in run
Arbiter(self).run()
File "/usr/local/lib/python3.5/dist-packages/gunicorn/arbiter.py", line 189, in run
self.manage_workers()
File "/usr/local/lib/python3.5/dist-packages/gunicorn/arbiter.py", line 524, in manage_workers
self.spawn_workers()
File "/usr/local/lib/python3.5/dist-packages/gunicorn/arbiter.py", line 590, in spawn_workers
self.spawn_worker()
File "/usr/local/lib/python3.5/dist-packages/gunicorn/arbiter.py", line 557, in spawn_worker
worker.init_process()
File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/base.py", line 132, in init_process
self.run()
File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/sync.py", line 124, in run
self.run_for_one(timeout)
File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/sync.py", line 68, in run_for_one
self.accept(listener)
File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/sync.py", line 30, in accept
self.handle(listener, client, addr)
File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/sync.py", line 135, in handle
self.handle_request(listener, req, client, addr)
File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/sync.py", line 176, in handle_request
respiter = self.wsgi(environ, resp.start_response)
File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/wsgi.py", line 170, in __call__
response = self.get_response(request)
File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/base.py", line 124, in get_response
response = self._middleware_chain(request)
File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/exception.py", line 39, in inner
response = get_response(request)
File "/usr/local/lib/python3.5/dist-packages/django/utils/deprecation.py", line 133, in __call__
response = self.get_response(request)
File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/exception.py", line 39, in inner
response = get_response(request)
File "/usr/local/lib/python3.5/dist-packages/django/utils/deprecation.py", line 133, in __call__
response = self.get_response(request)
File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/exception.py", line 39, in inner
response = get_response(request)
File "/usr/local/lib/python3.5/dist-packages/django/utils/deprecation.py", line 133, in __call__
response = self.get_response(request)
File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/exception.py", line 39, in inner
response = get_response(request)
File "/usr/local/lib/python3.5/dist-packages/django/utils/deprecation.py", line 133, in __call__
response = self.get_response(request)
File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/exception.py", line 39, in inner
response = get_response(request)
File "/usr/local/lib/python3.5/dist-packages/django/utils/deprecation.py", line 133, in __call__
response = self.get_response(request)
File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/exception.py", line 39, in inner
response = get_response(request)
File "/usr/local/lib/python3.5/dist-packages/django/utils/deprecation.py", line 133, in __call__
response = self.get_response(request)
File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/exception.py", line 39, in inner
response = get_response(request)
File "/usr/local/lib/python3.5/dist-packages/django/utils/deprecation.py", line 133, in __call__
response = self.get_response(request)
File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/exception.py", line 39, in inner
response = get_response(request)
File "/usr/local/lib/python3.5/dist-packages/django/utils/deprecation.py", line 133, in __call__
response = self.get_response(request)
File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/exception.py", line 39, in inner
response = get_response(request)
File "/app/api/middleware.py", line 22, in __call__
response = self.get_response(request)
File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/exception.py", line 39, in inner
response = get_response(request)
File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/base.py", line 185, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/usr/local/lib/python3.5/dist-packages/django/views/decorators/csrf.py", line 58, in wrapped_view
return view_func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/rest_framework/viewsets.py", line 87, in view
return self.dispatch(request, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/rest_framework/views.py", line 471, in dispatch
response = handler(request, *args, **kwargs)
File "/app/api/views.py", line 181, in create
return super(AppResourceViewSet, self).create(request, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/rest_framework/mixins.py", line 21, in create
self.perform_create(serializer)
File "/app/api/viewsets.py", line 21, in perform_create
self.post_save(obj)
File "/app/api/views.py", line 258, in post_save
self.release = build.create(self.request.user)
File "/app/api/models/build.py", line 62, in create
source_version=self.version
File "/app/api/models/release.py", line 88, in new
release.publish()
File "/app/api/models/release.py", line 135, in publish
publish_release(source_image, self.image, deis_registry, self.get_registry_auth())
File "/app/registry/dockerclient.py", line 199, in publish_release
return DockerClient().publish_release(source, target, deis_registry, creds)
File "/app/registry/dockerclient.py", line 117, in publish_release
self.push("{}/{}".format(self.registry, name), tag)
File "/usr/local/lib/python3.5/dist-packages/backoff.py", line 286, in retry
ret = target(*args, **kwargs)
File "/app/registry/dockerclient.py", line 135, in push
log_output(stream, 'push', repo, tag)
File "/app/registry/dockerclient.py", line 175, in log_output
for chunk in stream:
File "/usr/local/lib/python3.5/dist-packages/docker/client.py", line 245, in _stream_helper
data = reader.read(1)
File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/response.py", line 314, in read
data = self._fp.read(amt)
File "/usr/lib/python3.5/http/client.py", line 448, in read
n = self.readinto(b)
File "/usr/lib/python3.5/http/client.py", line 478, in readinto
return self._readinto_chunked(b)
File "/usr/lib/python3.5/http/client.py", line 573, in _readinto_chunked
chunk_left = self._get_chunk_left()
File "/usr/lib/python3.5/http/client.py", line 541, in _get_chunk_left
chunk_left = self._read_next_chunk_size()
File "/usr/lib/python3.5/http/client.py", line 501, in _read_next_chunk_size
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib/python3.5/socket.py", line 575, in readinto
return self._sock.recv_into(b)
File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/base.py", line 191, in handle_abort
self.cfg.worker_abort(self)
File "/app/deis/gunicorn/config.py", line 36, in worker_abort
traceback.print_stack()
kubernetes
When checking the status of the cluster, either through kubectl cluster-info
or kubectl get componentstatus
, it will report that either client: etcd cluster is unavailable or misconfigured
or HTTP probe failed with statuscode: 503
- frequently shifting back and forth between the two at odd intervals.
flaky etcd
2016-09-18 22:34:53
rbellamy@eanna i ~/kubernetes % kubectl get componentstatuses
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-0 Healthy {"health": "true"}
2016-09-18 22:34:58
rbellamy@eanna i ~/kubernetes % kubectl get componentstatuses
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-0 Unhealthy HTTP probe failed with statuscode: 503
2016-09-18 22:35:00
rbellamy@eanna i ~/kubernetes % kubectl get componentstatuses
NAME STATUS MESSAGE ERROR
scheduler Healthy ok
controller-manager Healthy ok
etcd-0 Healthy {"health": "true"}
2016-09-18 22:35:02
rbellamy@eanna i ~/kubernetes % kubectl get componentstatuses
NAME STATUS MESSAGE ERROR
scheduler Healthy ok
controller-manager Healthy ok
etcd-0 Unhealthy HTTP probe failed with statuscode: 503
2016-09-18 22:35:04
rbellamy@eanna i ~/kubernetes % kubectl get componentstatuses
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-0 Healthy {"health": "true"}
2016-09-18 22:35:06
rbellamy@eanna i ~/kubernetes % kubectl get componentstatuses
NAME STATUS MESSAGE ERROR
scheduler Healthy ok
controller-manager Healthy ok
etcd-0 Healthy {"health": "true"}
2016-09-18 22:35:07
rbellamy@eanna i ~/kubernetes % kubectl get componentstatuses
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-0 Unhealthy HTTP probe failed with statuscode: 503
2016-09-18 22:35:09
rbellamy@eanna i ~/kubernetes % kubectl get componentstatuses
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-0 Unhealthy HTTP probe failed with statuscode: 503
docker + CoreOS
As mentioned above, kubernetes seems to have problems maintaining a healthy connection to etcd
, but those types of errors do not surface in kubernetes, as far as I could tell, except for via the cluster-info
or get componentstatus
commands. In other words, all pods will appear to be functioning.
When you begin investigating the CoreOS VMs that host the kubernetes cluster, there are a plethora of errors - the first being the ubiquitous etcdserver: failed to send out heartbeat on time,etcdserver: server is likely overloaded
(etcd-io/etcd#5154), which might appear to require the addition of the heartbeat -interval
and election-timeout
values in the user_data.yml
etcd2
section. Nope.
All of this appears to belong to CoreOS coreos/bugs#1554. However further reading brings you to moby/moby#26492, which in conjunction with a Host OS that runs the systemd-networkd.service
create a race condition that prevents the docker veth
device from being associated with the in-cluster bridge network.