kubereboot / kured Goto Github PK

View Code? Open in Web Editor NEW

2.2K 2.2K 202.0 2.01 MB

Kubernetes Reboot Daemon

Home Page: https://kured.dev

License: Apache License 2.0

Go 92.66% Makefile 1.78% Dockerfile 0.72% Shell 4.84%

kubernetes kured reboot

kured's People

Contributors

Stargazers

Watchers

Forkers

3dinfluence ateleshev alexxnica kryndex winjer tikyau lunarfs yves-vogl plumdog razaj92 jfkw bee42 tgarlot mvisonneau jmreicha aland-zhang paulgmiller arun-reddy-16 jackbatzner carlsoncoder mdaguete balvisio jjjordanmsft mrwulf nagarjunk pixelrobots edernucci danielcoman altschool isotoma sylr azenk dwoods mblaschke visionthinking nighthawk22 albertocsm dennyx ericsuhong elbart0washere forkkit monester alexeldeib bsmr hrodrig pgroene shankar-r10n arianvp rmartinez3 yanno40 timothyrr evrardjp mssola johnrkriter mepms 40a tburko hkamel javefang crgarcia12 imaginarystargazer laghoule dfrappart smueller18 swisstxt lyedc limberger miykh chiranjeevi-panchakarla chentex sundar27 pitchunath stuartpb vmichel95 renaudhager csa-danielvillamizar krishna000 abhirashi kgorskowski anyfavors damoon laymui biswarup1290dass somasharath nareshsurisetty cnmcavoy uvw jiamaozheng flbla sozercan cloudplatformer jackfrancis nallurichandu99 daniel-oh isabella232 java4all yershalom atighineanu bohlenc spingel

kured's Issues

security hardening

Kured seems to have been built with the assumption that all nodes are trustworthy.

This assumption can be broken in a multitenant environment, or in an environment when a single node gets compromised.

With a quick look through of the code, there is some low hanging fruit that could potentially be fixed to tighten up security:

The lock is implemented by tweaking the kured daemonset. But this has enough permissions edit all the daemonsets pods in the entire cluster and replace it with a privileged container of the attackers choice. This can be fixed in two ways I can think of. One is to make the lock item different, such as using a configmap. The other is to make a separate service that is in charge of issuing locks so the nodes themselves can't.
Permissions are granted to the kured service account to drain/cordon/uncordon workload, but there can't be a restriction on which node is allowed to be acted on. So rather then using the account to drain the current load, an attacker could drain all nodes but the current node, attracting workload along with their secrets to the current node, then read them off the host who now has access to them. This too can be solved either of 2 ways. having a central service draining/uncordoning which an be pinned to a more critical node. alternately it may be possible to use the kubelets own credentials to drain itself. The noderestriction admission controller should only allow it to act upon itself with a node credential.

draining but never gets to reboot

Kured kicked in rightly, and disabled scheduling for my node:

app03.lan.davidkarlsen.com   Ready,SchedulingDisabled   <none>    62d       v1.11.2

I can see a number of pods being killed:

root@app03:/var/log/containers# tail kured-vkwnk_kube-system_kured-83287b3a6ba5d8a4dfd8a22822932a1655b71cc2ca2bfbd5007f5d389992100c.log 
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"kube-system-kubernetes-dashboard-proxy-55c7756d46-dsqzq\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.066034203Z"}
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"coredns-78fcdf6894-k87gl\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.066110928Z"}
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"monitoring-grafana-788f47b84-bkggz\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.066134368Z"}
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"monitoring-prometheus-alertmanager-cbcc46d55-gwkqz\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.179296096Z"}
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"logging-cerebro-6794fc6bc6-t26v9\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.179378188Z"}
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"monocular-monocular-mongodb-5644f785b9-24tmz\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.202395762Z"}
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"monitoring-prometheus-blackbox-exporter-7775df5698-86s67\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.202444091Z"}
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"monitoring-prometheus-server-75bfb9f66-xm9vp\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.249172642Z"}
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"kube-ops-view-kube-ops-view-kube-ops-view-6db67848c4-krmx8\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.449382938Z"}
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"logging-elasticsearch-client-5978d8f465-t9kkm\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.649729383Z"}
root@app03:/var/log/containers#

but then nothing more happens. I guess it fails at something - but the logs should tell why.

these pods are left (which are mainly daemons, except for the nginx-ingress:

Non-terminated Pods:         (9 in total)
  Namespace                  Name                                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                                          ------------  ----------  ---------------  -------------
  auditbeat                  auditbeat-auditbeat-q4k65                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)
  datadog                    datadog-datadog-agent-datadog-pr9w5                           200m (2%)     200m (2%)   256Mi (1%)       256Mi (1%)
  kube-system                calico-node-qlmcc                                             250m (3%)     0 (0%)      0 (0%)           0 (0%)
  kube-system                kube-proxy-5cf42                                              0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                kube-system-nginx-ingress-controller-84f76b76cb-jp8dr         0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                kube-system-nginx-ingress-default-backend-6b557bb97c-vlfqc    0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                kured-vkwnk                                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)
  logging                    fluent-bit-2djgl                                              100m (1%)     0 (0%)      100Mi (0%)       100Mi (0%)
  monitoring                 monitoring-prometheus-node-exporter-6jl6k                     0 (0%)        0 (0%)      0 (0%)           0 (0%)

any hints?

Kured memory usage seems high

Hi folks,

Apologies if this seems like a daft question. I noticed that kured, as deployed in our clusters, is using quite a large amount of memory upon first glance as shown by kubectl top pods:

NAME                            CPU(cores)   MEMORY(bytes) 
kured-57mc6                     0m           489Mi           
kured-7rdms                     0m           547Mi           
kured-bpcz8                     0m           431Mi           
kured-jzkpr                     0m           385Mi           
kured-kf2dj                     0m           302Mi           
kured-sm8r6                     0m           472Mi

Is kured supposed to be using this much memory, and if so, what is likely to contribute to this usage? If this memory usage seems high, are there things that could be done to reduce it?

Thanks in advance for your help.

Consider adding a toleration to run pods on tainted master nodes

Tainting master nodes seems like a widely used way of forbidding pods to be scheduled on master nodes. We want kured pods to also be created on master nodes though to manage their updates as well.

We consider adding the corresponding toleration so users aren't confused when kured doesn't reboot their master nodes (as seen in slack).

     tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule

Notify when draining a node, not when rebooting

It's probably more useful to be told when a node is being taken out of service, than to know that an already out of service node is about to be rebooted.

Release new version

Are you planning to release new versions or are you recommending everyone uses git master?

Hi! Thanks for kured, very useful. I was wondering, if I have scheduled backups and right during a backup kured detects that a node requires rebooting, will the backups or any other cron jobs complete first or will they be terminated or something? Thanks

Set the nodes update/restart order

Feature request: ability to set the order of the nodes based configured query.

For example Statefullset pod: draining Prometheus with only 1 replica cause Metrics lost.

In order to prevent more than 1 drain process with metric lost, set Prometheus node to be the latest upgrade node and by doing so prevent moving Prometheus more than once.

Issue with Kured and cluster autoscaler running together

When the first node reboots, CAS brings up another node to place the pending pods drained and waiting to be scheduled. Once the process of rebooting all nodes is complete, there is this one node that has not been patched with reboots. The process of rebooting nodes and the CAS bringing up a new node that is not patch continues. Do you have any suggestions on how this should work?

nsenter: setns(): can't reassociate to namespace 'mnt': Operation not permitted

running on master-114c349

{"log":"time=\"2018-12-06T21:32:02Z\" level=warning msg=\"nsenter: setns(): can't reassociate to namespace 'mnt': Operation not permitted\" cmd=/usr/bin/nsenter std=err\n","stream":"stderr","time":"2018-12-06T21:32:02.953439101Z"}
{"log":"time=\"2018-12-06T21:33:02Z\" level=warning msg=\"nsenter: setns(): can't reassociate to namespace 'mnt': Operation not permitted\" cmd=/usr/bin/nsenter std=err\n","stream":"stderr","time":"2018-12-06T21:33:02.955788412Z"}
{"log":"time=\"2018-12-06T21:34:02Z\" level=warning msg=\"nsenter: setns(): can't reassociate to namespace 'mnt': Operation not permitted\" cmd=/usr/bin/nsenter std=err\n","stream":"stderr","time":"2018-12-06T21:34:02.960860999Z"}
{"log":"time=\"2018-12-06T21:34:13Z\" level=warning msg=\"nsenter: setns(): can't reassociate to namespace 'mnt': Operation not permitted\" cmd=/usr/bin/nsenter std=err\n","stream":"stderr","time":"2018-12-06T21:34:13.861012393Z"}
{"log":"time=\"2018-12-06T21:34:13Z\" level=info msg=\"Reboot not required\"\n","stream":"stderr","time":"2018-12-06T21:34:13.861419579Z"}
{"log":"time=\"2018-12-06T21:35:02Z\" level=warning msg=\"nsenter: setns(): can't reassociate to namespace 'mnt': Operation not permitted\" cmd=/usr/bin/nsenter std=err\n","stream":"stderr","time":"2018-12-06T21:35:02.963461403Z"}
{"log":"time=\"2018-12-06T21:36:02Z\" level=warning msg=\"nsenter: setns(): can't reassociate to namespace 'mnt': Operation not permitted\" cmd=/usr/bin/nsenter std=err\n","stream":"stderr","time":"2018-12-06T21:36:02.970068445Z"}
{"log":"time=\"2018-12-06T21:37:02Z\" level=warning msg=\"nsenter: setns(): can't reassociate to namespace 'mnt': Operation not permitted\" cmd=/usr/bin/nsenter std=err\n","stream":"stderr","time":"2018-12-06T21:37:02.972730235Z"}
{"log":"time=\"2018-12-06T21:38:02Z\" level=warning msg=\"nsenter: setns(): can't reassociate to namespace 'mnt': Operation not permitted\" cmd=/usr/bin/nsenter std=err\n","stream":"stderr","time":"2018-12-06T21:38:02.978542906Z"}
{"log":"time=\"2018-12-06T21:39:02Z\" level=warning msg=\"nsenter: setns(): can't reassociate to namespace 'mnt': Operation not permitted\" cmd=/usr/bin/nsenter std=err\n","stream":"stderr","time":"2018-12-06T21:39:02.980484477Z"}

run script

Some updates that can happen need to happen after the node is drained to not effect workload. (some kinds of docker upgrades for example)

So the procedure might be:

detect upgrades, flag node
controller sees flag, drains node
upgrades applied on the node
reboot
controller uncordons

Can a hook on the node agent be added to run a script before reboot but after drain?

RBAC support

Thanks very much for this, it's really useful.

There doesn't seem to be any RBAC support yet, and it would be useful. It probably needs to run with it's own service account kube-system:kured, and have a role and rolebindings, maybe packaged up as a helm chart?

I'm happy to develop and submit a patch, if you are ok to review it?

Pulling latest doesn't work

Hi!

For some reason, quay.io doesn't have an image tagged 'latest', even if commit c42fff3 seems to suggest there is a latest tag

Events of a new deploy (using the helm chart), when pointing to 'latest':

LAST SEEN   FIRST SEEN   COUNT     NAME                              KIND        SUBOBJECT                TYPE      REASON             SOURCE                              MESSAGE              
11s         11s          1         kured-01.155212f4960bd900         DaemonSet                            Normal    SuccessfulCreate   daemonset-controller                Created pod: kured-01-zb8jc
10s         10s          1         kured-01-zb8jc.155212f4d9572677   Pod         spec.containers{kured}   Normal    Pulling            kubelet, aks-nodepool1-31356381-1   pulling image "quay.io/weaveworks/kured:latest"
9s          9s           1         kured-01-zb8jc.155212f4ebbc020b   Pod         spec.containers{kured}   Warning   Failed             kubelet, aks-nodepool1-31356381-1   Failed to pull image "quay.io/weaveworks/kured:latest": rpc error: code = Unknown desc = Tag latest not found in repository quay.io/weaveworks/kured                                                                 9s          9s           1         kured-01-zb8jc.155212f4ebbc4a50   Pod         spec.containers{kured}   Warning   Failed             kubelet, aks-nodepool1-31356381-1   Error: ErrImagePull
8s          8s           1         kured-01-zb8jc.155212f5368dff63   Pod         spec.containers{kured}   Normal    BackOff            kubelet, aks-nodepool1-31356381-1   Back-off pulling image "quay.io/weaveworks/kured:latest"                                                                                                                                                             8s          8s           1         kured-01-zb8jc.155212f5368e2e43   Pod         spec.containers{kured}   Warning   Failed             kubelet, aks-nodepool1-31356381-1   Error: ImagePullBackOff

kured-1.1.0.yaml references quay

I noticed that the file https://github.com/weaveworks/kured/releases/download/1.1.0/kured-1.1.0.yaml still references quay not dockerhub, so the installation instructions fail.

Feature request : Give ability to override slack channel name

Slack incoming web hook lets you override channel name. It would be good to give an option to override, as we can have single incoming web hook integration but send notifications for different environments to different channels.

Node stays on Ready,SchedulingDisabled

We are using Kured on AKS and I regularly I see that nodes stay on status Ready,SchedulingDisabled and I have to uncordon them manually.
When I look into the log file of the kured pod it shows:
time="2019-03-06T06:30:27Z" level=info msg="Kubernetes Reboot Daemon: 1.1.0"
time="2019-03-06T06:30:27Z" level=info msg="Node ID: aks-default-13951270-0"
time="2019-03-06T06:30:27Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock"
time="2019-03-06T06:30:27Z" level=info msg="Reboot Sentinel: /var/run/reboot-required every 1h0m0s"
time="2019-03-06T06:30:28Z" level=info msg="Holding lock"
time="2019-03-06T06:30:28Z" level=info msg="Uncordoning node aks-default-13951270-0"
time="2019-03-06T06:30:29Z" level=info msg="node/aks-default-13951270-0 uncordoned" cmd=/usr/bin/kubectl std=out
time="2019-03-06T06:30:29Z" level=info msg="Releasing lock"

So it says it uncordoned it, but still I regularly see that nodes are in fact not uncordoned.
Is this something you guys see more often?

Kured v.1.2 fails on Kubernetes v.1.16 (on prem, kubeadm created)

Kured v.1.2 fails on my kubeadm created K8s v.1.16.0 cluster. It seems that this issue has been solved by fix #75, which isn't a part of a release yet.

The error message I received before compiling kubed from the latest source was:

time="2019-09-22T19:45:07Z" level=info msg="Blocking Pod Selectors: []"
time="2019-09-22T19:45:07Z" level=fatal msg="Error testing lock: the server could not find the requested resource"

It would be really nice to get a kubed release v.1.3 including an updated stable helm chart. This will hopefully make the stable kured helm chart work on a K8s v.1.16 clusters without modifications.

I have not used Kured on older Kubernetes versions (yet).

Set delay between reboots

A flag to set delays between reboots should be added. Just like --period for checking for sentinel file, we should have something like --reboot-delay to specify a time frame between reboots of the nodes which have the sentinel file. A use case I am having is a cluster containing large pods taking 30 min to load. If I let kured do its job, the nodes are restarted too quickly for the pods to recover.

Should always drain before reboot

According to https://github.com/weaveworks/kured/blob/d7b9c9fbec26e113d8b90499c3e58bc06098581c/cmd/kured/main.go#L256, kured will only drain the node before rebooting, if the node was not already cordoned/unschedulable.

Just because a node is marked unschedulable does not mean all workloads were already drained. The node might have been cordoned manually but not drained before, kured might have started a drain but might have been interrupted (crash, update, whatever) before finishing the drain. Always running drain before reboot makes this safer, and if there are no more workloads to drain, is effectively a no-op.

reboot not working: "Transport endpoint is not connected"

this is the output I get from kured:

{"log":"time=\"2018-08-07T13:27:21Z\" level=info msg=\"Kubernetes Reboot Daemon: master-5731b98\"\n","stream":"stderr","time":"2018-08-07T13:27:21.023797488Z"}
{"log":"time=\"2018-08-07T13:27:21Z\" level=info msg=\"Node ID: app02.lan.davidkarlsen.com\"\n","stream":"stderr","time":"2018-08-07T13:27:21.02386206Z"}
{"log":"time=\"2018-08-07T13:27:21Z\" level=info msg=\"Lock Annotation: kube-system/kured:weave.works/kured-node-lock\"\n","stream":"stderr","time":"2018-08-07T13:27:21.023876817Z"}
{"log":"time=\"2018-08-07T13:27:21Z\" level=info msg=\"Reboot Sentinel: /var/run/reboot-required every 1h0m0s\"\n","stream":"stderr","time":"2018-08-07T13:27:21.023889431Z"}
{"log":"time=\"2018-08-07T14:26:17Z\" level=info msg=\"Reboot required\"\n","stream":"stderr","time":"2018-08-07T14:26:17.381148603Z"}
{"log":"time=\"2018-08-07T14:26:17Z\" level=warning msg=\"Reboot blocked: 1 active alerts: [deployment_replicas_mismatch]\"\n","stream":"stderr","time":"2018-08-07T14:26:17.439989295Z"}
{"log":"time=\"2018-08-07T15:26:17Z\" level=info msg=\"Reboot required\"\n","stream":"stderr","time":"2018-08-07T15:26:17.381452887Z"}
{"log":"time=\"2018-08-07T15:26:22Z\" level=warning msg=\"Reboot blocked: 1 active alerts: [deployment_replicas_mismatch]\"\n","stream":"stderr","time":"2018-08-07T15:26:22.38692619Z"}
{"log":"time=\"2018-08-07T16:26:17Z\" level=info msg=\"Reboot required\"\n","stream":"stderr","time":"2018-08-07T16:26:17.381447461Z"}
{"log":"time=\"2018-08-07T16:26:17Z\" level=warning msg=\"Reboot blocked: 1 active alerts: [deployment_replicas_mismatch]\"\n","stream":"stderr","time":"2018-08-07T16:26:17.387051589Z"}
{"log":"time=\"2018-08-07T17:26:17Z\" level=info msg=\"Reboot required\"\n","stream":"stderr","time":"2018-08-07T17:26:17.381804597Z"}
{"log":"time=\"2018-08-07T17:26:17Z\" level=warning msg=\"Reboot blocked: 1 active alerts: [deployment_replicas_mismatch]\"\n","stream":"stderr","time":"2018-08-07T17:26:17.387159836Z"}
{"log":"time=\"2018-08-07T18:26:17Z\" level=info msg=\"Reboot required\"\n","stream":"stderr","time":"2018-08-07T18:26:17.381839419Z"}
{"log":"time=\"2018-08-07T18:26:17Z\" level=warning msg=\"Lock already held: app03.lan.davidkarlsen.com\"\n","stream":"stderr","time":"2018-08-07T18:26:17.953898535Z"}
{"log":"time=\"2018-08-07T19:26:17Z\" level=info msg=\"Reboot required\"\n","stream":"stderr","time":"2018-08-07T19:26:17.532210049Z"}
{"log":"time=\"2018-08-07T19:26:17Z\" level=warning msg=\"Reboot blocked: 1 active alerts: [deployment_replicas_mismatch]\"\n","stream":"stderr","time":"2018-08-07T19:26:17.778564028Z"}
{"log":"time=\"2018-08-07T20:26:17Z\" level=info msg=\"Reboot required\"\n","stream":"stderr","time":"2018-08-07T20:26:17.51315201Z"}
{"log":"time=\"2018-08-07T20:26:17Z\" level=warning msg=\"Reboot blocked: 1 active alerts: [deployment_replicas_mismatch]\"\n","stream":"stderr","time":"2018-08-07T20:26:17.950332758Z"}
{"log":"time=\"2018-08-07T21:26:17Z\" level=info msg=\"Reboot required\"\n","stream":"stderr","time":"2018-08-07T21:26:17.502508634Z"}
{"log":"time=\"2018-08-07T21:26:19Z\" level=info msg=\"Acquired reboot lock\"\n","stream":"stderr","time":"2018-08-07T21:26:19.041613032Z"}
{"log":"time=\"2018-08-07T21:26:19Z\" level=info msg=\"Draining node app02.lan.davidkarlsen.com\"\n","stream":"stderr","time":"2018-08-07T21:26:19.041639516Z"}
{"log":"time=\"2018-08-07T21:26:27Z\" level=info msg=\"node \\\"app02.lan.davidkarlsen.com\\\" cordoned\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:27.608428496Z"}
{"log":"time=\"2018-08-07T21:26:29Z\" level=warning msg=\"WARNING: Deleting pods with local storage: anchore-anchore-engine-anchore-engine-worker-55bc984d7-km7zx, flux-584d78b89f-xhrch, minio-minio-799cd646f-ks7qv, monocular-monocular-monocular-api-6747bb55c-httx5; Ignoring DaemonSet-managed pods: auditbeat-auditbeat-ffd58, datadog-datadog-agent-datadog-clqsm, calico-node-4kjvx, kube-proxy-6hmhz, kured-q7jkr, fluent-bit-thql4, monitoring-prometheus-node-exporter-xf4rj\" cmd=/usr/bin/kubectl std=err\n","stream":"stderr","time":"2018-08-07T21:26:29.263488783Z"}
{"log":"time=\"2018-08-07T21:26:30Z\" level=info msg=\"pod \\\"metrics-server-metrics-server-658d69ddf7-hwh7p\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:30.129399317Z"}
{"log":"time=\"2018-08-07T21:26:30Z\" level=info msg=\"pod \\\"minio-minio-799cd646f-ks7qv\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:30.132816317Z"}
{"log":"time=\"2018-08-07T21:26:30Z\" level=info msg=\"pod \\\"anchore-anchore-engine-anchore-engine-worker-55bc984d7-km7zx\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:30.143296645Z"}
{"log":"time=\"2018-08-07T21:26:30Z\" level=info msg=\"pod \\\"flux-584d78b89f-xhrch\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:30.143492166Z"}
{"log":"time=\"2018-08-07T21:26:30Z\" level=info msg=\"pod \\\"logging-elasticsearch-data-1\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:30.143637461Z"}
{"log":"time=\"2018-08-07T21:26:30Z\" level=info msg=\"pod \\\"anchore-anchore-engine-postgresql-5cd6586d5b-mp6d8\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:30.144007446Z"}
{"log":"time=\"2018-08-07T21:26:30Z\" level=info msg=\"pod \\\"logging-elasticsearch-master-1\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:30.144136576Z"}
{"log":"time=\"2018-08-07T21:26:31Z\" level=info msg=\"pod \\\"anchore-anchore-engine-anchore-engine-core-645cd6b7fd-2w4fs\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:31.950339088Z"}
{"log":"time=\"2018-08-07T21:26:31Z\" level=info msg=\"pod \\\"monitoring-prometheus-alertmanager-cbcc46d55-nkxkh\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:31.950422791Z"}
{"log":"time=\"2018-08-07T21:26:34Z\" level=info msg=\"pod \\\"kube-ops-view-kube-ops-view-kube-ops-view-6db67848c4-dfgqg\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:34.623470381Z"}
{"log":"time=\"2018-08-07T21:26:36Z\" level=info msg=\"pod \\\"monocular-monocular-monocular-ui-58f9f95864-qr7fm\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:36.040687921Z"}
{"log":"time=\"2018-08-07T21:26:36Z\" level=info msg=\"pod \\\"flux-memcached-5f8b4c7dc8-l6b5t\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:36.199982135Z"}
{"log":"time=\"2018-08-07T21:26:36Z\" level=info msg=\"pod \\\"tiller-deploy-5c688d5f9b-v2hbx\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:36.200033975Z"}
{"log":"time=\"2018-08-07T21:26:36Z\" level=info msg=\"pod \\\"kube-system-kubernetes-dashboard-proxy-64f7674f88-c97m4\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:36.200053208Z"}
{"log":"time=\"2018-08-07T21:26:36Z\" level=info msg=\"pod \\\"kube-system-heapster-heapster-56c646d674-87ptg\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:36.688737532Z"}
{"log":"time=\"2018-08-07T21:26:36Z\" level=info msg=\"pod \\\"logging-elasticsearch-client-5978d8f465-dr6ft\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:36.688776421Z"}
{"log":"time=\"2018-08-07T21:26:37Z\" level=info msg=\"pod \\\"monocular-monocular-monocular-prerender-6ffdcd79c4-cfs8n\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:37.402532532Z"}
{"log":"time=\"2018-08-07T21:26:37Z\" level=info msg=\"pod \\\"monitoring-prometheus-server-75bfb9f66-wxmq5\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:37.402572959Z"}
{"log":"time=\"2018-08-07T21:26:37Z\" level=info msg=\"pod \\\"monitoring-prometheus-blackbox-exporter-7775df5698-cw49s\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:37.403119108Z"}
{"log":"time=\"2018-08-07T21:26:37Z\" level=info msg=\"pod \\\"hubot-hubot-d7dc4978c-f48nz\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:37.403141351Z"}
{"log":"time=\"2018-08-07T21:26:37Z\" level=info msg=\"pod \\\"monocular-monocular-monocular-api-6747bb55c-httx5\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:37.403189515Z"}
{"log":"time=\"2018-08-07T21:26:37Z\" level=info msg=\"node \\\"app02.lan.davidkarlsen.com\\\" drained\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-07T21:26:37.403202192Z"}
{"log":"time=\"2018-08-07T21:26:37Z\" level=info msg=\"Commanding reboot\"\n","stream":"stderr","time":"2018-08-07T21:26:37.406949235Z"}
{"log":"time=\"2018-08-07T21:26:42Z\" level=warning msg=\"Error notifying slack: Post https://hooks.slack.com/services/obfuscated/obfuscated/obfuscated: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\"\n","stream":"stderr","time":"2018-08-07T21:26:42.407467437Z"}
{"log":"time=\"2018-08-07T21:27:16Z\" level=warning msg=\"Failed to set wall message, ignoring: Connection reset by peer\" cmd=/bin/systemctl std=err\n","stream":"stderr","time":"2018-08-07T21:27:16.609804186Z"}
{"log":"time=\"2018-08-07T21:27:16Z\" level=warning msg=\"Failed to reboot system via logind: Transport endpoint is not connected\" cmd=/bin/systemctl std=err\n","stream":"stderr","time":"2018-08-07T21:27:16.609844005Z"}

Error: unknown flag: --blocking-pod-selector

I have configured a Flag: --blocking-pod-selector=runtime=long to prevent kured to reboot some pods when running (according to the documentation), now Kured pods are in a CrashLoopBackOff status and the logs are:

Error: unknown flag: --blocking-pod-selector
Usage:
  kured [flags]

Flags:
      --alert-filter-regexp regexp.Regexp   alert names to ignore when checking for active alerts
      --ds-name string                      name of daemonset on which to place lock (default "kured")
      --ds-namespace string                 namespace containing daemonset on which to place lock (default "kube-system")
  -h, --help                                help for kured
      --lock-annotation string              annotation in which to record locking node (default "weave.works/kured-node-lock")
      --period duration                     reboot check period (default 1h0m0s)
      --prometheus-url string               Prometheus instance to probe for active alerts
      --reboot-sentinel string              path to file whose existence signals need to reboot (default "/var/run/reboot-required")
      --slack-hook-url string               slack hook URL for reboot notfications
      --slack-username string               slack username for reboot notfications (default "kured")

time="2019-03-07T16:57:24Z" level=fatal msg="unknown flag: --blocking-pod-selector"

Which shows like the flag isn't supported. I'm using version 1.1.0 of kured.

Azure AKS Reboot failing - Error invoking reboot command: exit status 127

The cordon and drain process is initiated but the reboot command fails. This then kills the container which causes the pods to permanently be cordoned and drained from every node every hour without the reboot occurring

Infrastructure
Azure AKS
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:08:19Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"

Pod

Name:               kured-9hwl9
Namespace:          core
Priority:           0
PriorityClassName:  <none>
Node:               0/10.4.0.6
Start Time:         Wed, 07 Nov 2018 16:24:18 +0000
Labels:             app=kured
                    controller-revision-hash=2895661024
                    pod-template-generation=1
                    release=
Annotations:        <none>
Status:             Running
IP:                 10.244.2.127
Controlled By:      DaemonSet/kured
Containers:
  kured:
    Container ID:  docker://bfe49a8a161d4f6e349ed30eaaa3894fb5332f690e00df2d6ec30f0bf3f0f25e
    Image:         quay.io/weaveworks/kured:1.1.0
    Image ID:      docker-pullable://quay.io/weaveworks/kured@sha256:9cb1aa3ffc06bd97c3a449eb69790fbda763c9c88195e293a03a3adfbfe4b512
    Port:          <none>
    Command:
      /usr/bin/kured
    Args:
      --ds-name=kured
      --ds-namespace=core
    State:          Running
      Started:      Thu, 08 Nov 2018 14:11:10 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 08 Nov 2018 13:39:10 +0000
      Finished:     Thu, 08 Nov 2018 14:11:05 +0000
    Ready:          True
    Restart Count:  22
    Environment:
      KURED_NODE_ID:   (v1:spec.nodeName)
    Mounts:
      /var/run from hostrun (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-j9s5f (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  hostrun:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run
    HostPathType:
  default-token-j9s5f:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-j9s5f
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type    Reason   Age                 From                          Message
  Normal  Pulled   40m (x23 over 22h)  kubelet, aks-qube-27091152-0  Container image "quay.io/weaveworks/kured:1.1.0" already present on machine
  Normal  Created  40m (x23 over 22h)  kubelet, aks-qube-27091152-0  Created container
  Normal  Started  40m (x23 over 22h)  kubelet, aks-qube-27091152-0  Started container

Logs

time="2018-11-08T14:11:05Z" level=info msg="Commanding reboot"
time="2018-11-08T14:11:05Z" level=warning msg="nsenter: can't execute '/bin/systemctl': No such file or directory" cmd=/usr/bin/nsenter std=err
time="2018-11-08T14:11:05Z" level=fatal msg="Error invoking reboot command: exit status 127"

Adding a chart to kubernetes/charts

Hi,

I've just opened a PR - helm/charts#6470 - to add a helm chart for kured, with a view to making kured easy to helm-install.

I thought it'd be sensible for me to create an issue here for two reasons:

to just sort of check that the creator(s) of kured are fine with that
to see if anyone who understands kured better than me wanted to look at the chart and either suggest changes (if the PR hasn't merged at time of reading) or contribute their own changes (if it has).

Better project maintenance

I have noticed a bunch of pull requests fixing multiple issues which are not merged into the project for a while. Is it possible to have a better repo maintenance? Thanks!

Option to allow reboots only in a time window

Hello there,

as a possible feature, can you add an option so that reboots can be constrained in a specific time window?
This way I would be able to have the reboots, let's say, happening only at night time.

Thank you

Detect when drain operation kills kured pod

kubectl drain should not in theory kill the kured pod as it's a DaemonSet; we rely on this behaviour, because after the drain operation is complete we need to command the reboot. We have however experienced the kured pod being killed during drain when the embedded version of kubectl is too different to the server (specifically, with kubectl 1.7.x against server 1.9.x), resulting in a never ending cycle of lock/drain/restart/unlock without the reboot actually occurring.

Possible fixes:

Detect client/server version mismatch on startup and refuse to operate (we should probably warn on this anyway)
Ignore TERM signals after drain commences, and have a long enough terminationGracePeriodSeconds that we can complete (problem: how long is long enough?)
Catch TERM during drain and print a warning message
Catch TERM during drain, and stash some information in the lock so that we don't cycle endlessly on restart

taint node with PreferNoSchedule until it gets drained

In case of multiple nodes asking to restart, pods will be moved more often then needed, because they can start on the node that will be rebooted next.
The use of PreferNoSchedule on nodes waiting to be rebooted, will prefer scheduling pods onto already rebooted nodes.

link: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/

Request: images for other architectures

Feature request: Please provide images for other architectures (more than amd64) as part of your release.

kured needs to run on every node, and for my cluster that means amd64, arm, and arm64. I can build those myself, but it would be better if other similar users could also benefit from some centralised effort to support other archs. Thanks :)

a daemon to reboot after some time

Just a simple information to let you know that I created a daemon to reboot after some uptime.
https://github.com/barpilot/kured-toujours
It's really simple and alpha for now, but it can be useful for some people.

It's mainly used to reboot periodically every node and avoid some deadlock with docker, mounts… with the power of kured.

lock on a separate/safe object rather than DaemonSet

Current locking approach requires the kured job to have the RBAC privilege to be able to update its own DaemonSet. This means a compromised kured job could redefine itself to (eg) elevate host access further(*).

Better (from an RBAC pov) would be to use a "harmless" resource type to store the lock annotation. A ConfigMap (empty/dedicated for this purpose) would be ideal.

(*) I acknowledge that kured also needs other scary privileges (patch nodes) in order to conduct a drain/uncordon, and these won't be improved by this suggestion. Small steps.

When Pod Disruption Budget blocks Node draining, no restart is possible

Currently for a Cluster running Istio, Node draining can be blocked by Kubernetes due to PodDisruptionBudget policy configuration. When this occurs, Kured until the lock timeouts.

Kured should ignore this error and reboot the Node anyway.

time="2019-07-24T14:28:47Z" level=warning msg="error when evicting pod \"istio-galley-75466f5dc7-cz7sd\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget." cmd=/usr/bin/kubectl std=err time="2019-07-24T14:28:52Z" level=warning msg="error when evicting pod \"istio-telemetry-66fbcd998b-ts27t\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget." cmd=/usr/bin/kubectl std=err time="2019-07-24T14:28:52Z" level=warning msg="error when evicting pod \"istio-sidecar-injector-6f4c67c6cd-62clr\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget." cmd=/usr/bin/kubectl std=err

Add a finer grained per-node lock

It may be operationally interesting to have a finer grained lock than the global lock on the DaemonSet in case a specific node needs to stay up for a reason or another.

Change use of docker login

Since this change in docker the use of docker login needs to be changed from

docker login -u $USER -p $PASS
to
echo $PASS | docker login -u $USER --password-stdin

(Since TravisCI attaches a terminal it will block forever. @stefanprodan brought this up.)

Integration with sysdig

I know there is integration with prometheus to deal with when alerts are firing. Is there any way to do the same functionality against Sysdig?

Problems pulling docker image

This morning I'm getting this error when trying to pull the kured docker image.

sudo docker pull quay.io/weaveworks/kured:master-5731b98
Error response from daemon: Get https://quay.io/v2/weaveworks/kured/manifests/master-5731b98: unknown: Namespace weaveworks has been disabled. Please contact a system administrator.

Vulnerabilities listed on quay.io

See https://quay.io/repository/weaveworks/kured/manifest/sha256:bee4241779a29a7f7f6a8e5de7a8a5d4042317236ab8ee6d60d50832ad9e55ed?tab=vulnerabilities for quay.io's automated assessment of the most recent public build (oddly-tagged, but someone has already posted that issue: #33)

Might some of these vulnerabilities be helped by using newer versions of the dependencies listed in https://github.com/weaveworks/kured/blob/master/Gopkg.toml ?

Needs extension for Centos/RHEL distros

The use of a file such as /var/run/reboot-required is specific to Ubuntu family distros, but such a flag does not get set for RHEL/Centos distros using Yum/RPM for package management. In those environments, the yum-utils package is installed, then the needs-restarting -r command can run to detect the need for a reboot when its exit code is 1.

Suggest adding an optional argument to kured that provides a command to be executed, If the command exists and the exit code from that command is non-zero, then trigger a reboot the same as if the sentinel file exists. The default value for the option, if provided, would be needs-restarting -r.

Reference: How can I check from the command line if a reboot is required on rhel or centos

An alternative approach would be include a short script along the lines of the following that runs when installed on a RHEL/Centos system:

[[ -x /bin/needs-restarting ]] && needs-restarting -r >/dev/null || touch /var/run/reboot-required

Prometheus SSL/TLS certificate

Hello,

I was wondering if there is a way to specify prometheus ssl certificates a long with prometheus-url. Actually, kured is encountring the following error when trying to contact Promehtues:

kured-2wksk kured time="2018-12-27T09:04:02Z" level=warning msg="Reboot blocked: prometheus query error: Get https://monitoring-prometheus:9090/api/v1/query?query=ALERTS&time=2018-12-27T09%3A04%3A02.438586828Z: x509: certificate signed by unknown authority"

Drain timeout

It would be great to be able to define a timeout for pod eviction.

If drain did not succeed within the timeout kured would stop trying to evict pods and release the lock.

This would be handy when pod disruption budgets does not allow eviction.

Regards.

Nodes are getting rebooted even though the underlying VMs are stopped

In our project we stop our VMs over night (9 PM to 7 AM) to save money. This configuration applies to the VMs for our AKS clusters as well.
Last friday at 4/5 AM we got a message in our slack channel that kured has rebooted the cluster nodes. That overlaps with the time frame when a node shouldn't be up. I checked it in the Azure Portal and the VMs were stopped successful the day before. They were down when the reboot happened.
Does anyone have any idea what might have happened?

AKS-Version: 1.13.5
Helm Chart Version: 1.3.1
Kured version: 1.2.0

Node not rebooted, pods are drained and restarted

I'm using Kured on Azure with an ACS Engine generated cluster, and I can see that nodes are being drained and refilled but it looks like they are not being rebooted.

For example, a reboot-required was set on 23:43 on April 13th for node k8s-agents-27478824-4:

$ ls -al
...
-rw-r--r-- 1 root root 0 Apr 13 23:43 reboot-required
...

And I see Kured triggering: draining and refilling nodes with pods:

$ kubectl get pods --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
cassandra cassandra-cassandra-0 1/1 Running 1 3d 10.30.0.37 k8s-agents-27478824-3
cassandra cassandra-cassandra-1 0/1 Pending 0 6s
...

Sadly, this seems to happen EVERY hour without fail. Digging into this it looks like this is because the nodes are actually not being rebooted:

$ last reboot
reboot system boot 4.13.0-1011-azur Fri Apr 13 23:17 still running
reboot system boot 4.13.0-1011-azur Sun Apr 8 19:21 still running

(Note that the last reboot time is before the timestamp of the reboot-required)

Is there something I need to do with Kured in order to tell it how to reboot nodes etc.? Or is this a bug?

Error testing lock: daemonsets.extensions \"kube-system\" not found"

Running K8s 1.11.2 on Azure Kubernetes Service and getting the errors with RBAC applied ( across all pods in the daemonset); using the yaml manifests under the master branch of kured repo.

time="2018-10-19T13:13:20Z" level=info msg="Kubernetes Reboot Daemon: master-b86c60f"
time="2018-10-19T13:13:20Z" level=info msg="Node ID: aks-node"
time="2018-10-19T13:13:20Z" level=info msg="Lock Annotation: kured/kube-system:weave.works/kured-node-lock"
time="2018-10-19T13:13:20Z" level=info msg="Reboot Sentinel: /var/run/reboot-required every 24h0m0s"
time="2018-10-19T13:13:20Z" level=fatal msg="Error testing lock: daemonsets.extensions \"kube-system\" not found"

kubectl get clusterrole kured
NAME    AGE
kured   37m

kubectl get clusterrolebinding kured
NAME    AGE
kured   38m

kubectl get role kured -n kube-system
NAME    AGE
kured   41m

 kubectl get rolebinding kured -n kube-system
NAME    AGE
kured   42m

kured is now referenced via https://docs.microsoft.com/en-us/azure/aks/concepts-security#node-security so would be good to get it working on aks :)

Parameterize the reboot function in order to be able to do other actions like shutdown for instance

An interesting feature for me would be to be able to shutdown instead of rebooting the node. Running on AWS, I would like to let the ASG replace my instance once it has been shutdown.

Notifications to Microsoft Teams

Hi,

I would like to use Microsoft Teams instead of Slack for drain/reboot notification. Implementation would be a lot similar as it's also using a simple webhook with a custom JSON structure.

Would you accept a PR for that?

thanks

kured is not compatible with rancheros

Just in case someone tries the route of trying kured on rancheros and gets those errors:

time="2019-03-15T14:42:22Z" level=warning msg="nsenter: can't execute '/usr/bin/test': No such file or directory" cmd=/usr/bin/nsenter std=err
time="2019-03-15T14:42:52Z" level=warning msg="nsenter: can't execute '/usr/bin/test': No such file or directory" cmd=/usr/bin/nsenter std=err
time="2019-03-15T14:42:52Z" level=info msg="Reboot not required"
time="2019-03-15T14:43:22Z" level=warning msg="nsenter: can't execute '/usr/bin/test': No such file or directory" cmd=/usr/bin/nsenter std=err
time="2019-03-15T14:43:52Z" level=warning msg="nsenter: can't execute '/usr/bin/test': No such file or directory" cmd=/usr/bin/nsenter std=err
time="2019-03-15T14:43:52Z" level=info msg="Reboot not required"
time="2019-03-15T14:44:22Z" level=warning msg="nsenter: can't execute '/usr/bin/test': No such file or directory" cmd=/usr/bin/nsenter std=err

It is because RancherOS has its PID:1 as the system-dockerd, which does not have much inside, and does not have the sentinel file (it would be in the os-console system container)

1 root     system-dockerd --restart=false --log-opt max-file=2 --log-opt max-size=25m --pidfile /var/run/system-docker.pid --userland-proxy=false --bip 172.18.42.1/16 --config-file /etc/docker/system-docker.json --exec-root /var/run/system-docker --group root --host unix:///var/run/system-docker.sock --graph /var/lib/system-docker --storage-driver overlay2

It would be nice if kured supported having an annotation on the node instead to have it rebooted (I know there is one to prevent reboot, but none to force, just the sentinel).

Have alerts be get via alertmanager

There are times in which we want to reboot nodes and a alert is blocking a reboot. instead of having to restart kured to add the alert in regex to allow reboot. If alerts are grabbed via prometheus alertmanager , this alert can be silenced via alertmanager and reboot will occur.

Pull request #42 adds this feature

Let me know what else is needed to accept this PR
Thanks

Get alerts via icinga API?

Proposal: Allow to pass icinga URL and host/service for alerts. Kured will query icinga API and if there is an active alert on the specific host/service - the node will not be rebooted.

Useful for users (like us) who use icinga/nagios for monitoring.

Docker Tag Names

The Docker image tags available at https://quay.io/repository/weaveworks/kured?tab=tags are challenging. Based on the tag name or information given I can hardly identify what version is running. Especially difficult considering upgrading a Kubernetes cluster to a new version. In this case I would like to update kured to a newer version as well; preferably to the respective kubectl version. Docker tags would help to identify the right version of kured to deploy.

"Invalid value" when trying to manually set the kured annotation

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.5", GitCommit:"17d7182a7ccbb167074be7a87f0a68bd00d58d97", GitTreeState:"clean", BuildDate:"2017-08-31T09:14:02Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.5", GitCommit:"cce11c6a185279d037023e02ac5249e14daa22bf", GitTreeState:"clean", BuildDate:"2017-12-07T16:05:18Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

$ kubectl -n kube-system annotate ds kured weave.works/kured-node-lock='{"nodeID":"manual"}'
The DaemonSet "kured" is invalid: metadata.annotations: Invalid value: "weave.works~1kured-node-lock": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')

Expose reboot sentinel existence as Prometheus metric

This would facilitate a fail-safe RebootRequired alert that would warn us if the reboot daemon is failing.