cybozu-go / cke Goto Github PK

View Code? Open in Web Editor NEW

193.0 193.0 15.0 15.31 MB

Cybozu Kubernetes Engine

License: Apache License 2.0

Go 96.86% Shell 1.28% Makefile 1.70% HCL 0.12% Dockerfile 0.05%

kubernetes kubernetes-installer sabakan

cke's Issues

dummy for add a new label and all Neco related repos

Run multi-host test on CentOS

Currently we are using CoreOS Container Linux for Node OS.

CoreOS employs SELinux but its docker containers run with kernel_t.
Under Red Hat, Fedora, or CentOS, docker containers are labelled with more restricted one.

Running mtest with CentOS Node improves our confidence for CKE's SELinux support.

What

Some tools in tests/ do not have tests.
For example, rivers has no tests.

Without tests, we cannot update them reliably.

How

Describe how to address the issue.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

What

Upgrade Kubernetes version to 1.24.

How

Upgrade container images, including etcd
CoreDNS, Kubernetes, cke-tools including CNI plugins, etcd, pause, unbound
Update containerd used in mtest
Update sonobuoy

From changelog

kube-apiserver
- ~~the --master-count flag and --endpoint-reconciler-type=master-count reconciler are deprecated in favor of the lease reconciler~~
- ~~the insecure address flags --address, --insecure-bind-address, --port and --insecure-port (inert since 1.20) are removed~~
~~The insecure address flags --address and --port in kube-controller-manager have had no effect since v1.20 and are removed in v1.24.~~
kubelet
- ~~The --pod-infra-container-image kubelet flag is deprecated and will be removed in future releases~~
- The feature DynamicKubeletConfig has been removed from the kubelet.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

What

Upgrade Kubernetes version to 1.23

How

Upgrade container images, including etcd
CoreDNS, Kubernetes, cke-tools including CNI plugins, etcd, pause, unbound
Update containerd used in mtest
Update sonobuoy

TODO

The Kube-scheduler's --port option is deplicated. Delete it.
https://github.com/cybozu-go/cke/blob/release-1.21/op/k8s/scheduler_boot.go#L111
The Kubelet's --register-with-taints option is deplicated. Update the following document.
https://github.com/cybozu-go/cke/blob/release-1.21/docs/cluster.md#boot-taints
kubescheduler.config.k8s.io/v1beta1 will be deleted. Use v1beta3.
https://github.com/cybozu-go/cke/blob/release-1.21/mtest/cke-cluster.yml#L27
https://github.com/cybozu-go/cke/blob/release-1.21/testdata/cluster.yaml#L43

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

"etcd-added-member" docker volume is left lacking once "etcd-add-member" fails at sync

Describe the bug

waitEtcdSyncCommand in AddMemberOp can fail due to timeout or some other reason.
Once waitEtcdSyncCommand fails, VolumeCreateCommand will be skipped and etcd-added-member docker volume will never be created.

Environments

Version: v1.23.4
OS: Ubuntu 20.04

To Reproduce

Login to one of the control plane nodes.
Execute docker volume remove etcd-added-member.
Wait and confirm that etcd-added-member is left lacking.

Expected behavior
etcd-added-member is (re-)created after waiting for sync.

Apply static resources without updating cke.cybozu.com/revision manually

What

CKE updates static resources if cke.cybozu.com/revision is updated, so the revision needs to be updated manually when the resource is edited.
This is error-prone, in fact, the failure due to the revision update missing has occurred.

How

Detect differences of YAML files using equality.DeepDerivative, and if there're differences then server-side apply.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

fix unfavorable behavior during node reboot

What

make cke reboot nodes smoothly.

CKE connect to nodes at the beginning of reconcile loop even if the nodes are rebooting. so:

get dial timeout (30s) and/or
get command timeout (10m)

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Add DefaultGatherer to collect standard metrics

What

I want to collect standard and runtime metrics.

How

Add prometheus.DefaultGatherer to promhttp.HandlerFor call.

ref:

https://github.com/prometheus/client_golang/blob/4739e9cf8cd673602da0facb331c6b8393559b46/prometheus/promhttp/http.go#L76
cke/metrics/collector.go

Line 71 in 9af11fa

handler := promhttp.HandlerFor(registry,

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

ComponentConfig for kube-proxy

What

Currently, CKE does not use ComponentConfig for kube-proxy whereas almost all command-line flags of kube-proxy have been deprecated.

In fact, kube-proxy issues a warning:

WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP.

We need to replace command-line args with ComponentConfig for kube-proxy.
https://pkg.go.dev/k8s.io/[email protected]/config/v1alpha1#KubeProxyConfiguration

For reference, EKS has been using ComponentConfig for a while.
aws/containers-roadmap#143

How

Describe how to address the issue.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Switch releasing tool from ghr to gh

What

The "release" workflow uses ghr to manipulate GitHub Releases.
We can now use GitHub's official command line tool gh provided by the Ubuntu virtual environment.

How

Describe how to address the issue.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

"etcd-added-member" docker volume is left lacking once "etcd-add-member" fails at sync

Describe the bug

Environments

Version: v1.23.4
OS: Ubuntu 20.04

To Reproduce

Login to one of the control plane nodes.
Execute docker volume remove etcd-added-member.
Wait and confirm that etcd-added-member is left lacking.

Expected behavior
etcd-added-member is (re-)created after waiting for sync.

Way to add specific node

What

Add a way to add nodes with specific sabakan labels to the cluster.

maybe related issue: #509

How

Describe how to address the issue.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Make kubelet and kube-proxy get started before kube-controller-manager and kube-scheduler

What

Currently, kubelet and kube-proxy start after kube-controller-manager and kube-scheduler.
This could decrease the chances for Pods to be distributed among a large number of Nodes.

Reorder the operations so that Node resources get created before kube-scheduler and kube-controller-manager.

How

Describe how to address the issue.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Command-line flags are polluted by glog

Try slok/noglog to resolve this problem:

$ cke -h
Usage of cke:
  -alsologtostderr
        log to standard error as well as files
  -config string
        configuration file path (default "/etc/cke/config.yml")
  -http string
        <Listen IP>:<Port number> (default "0.0.0.0:10180")
  -interval string
        check interval (default "10m")
  -log_backtrace_at value
        when logging hits line file:N, emit a stack trace
  -log_dir string
        If non-empty, write log files in this directory
  -logfile string
        Log filename
  -logformat string
        Log format [plain,logfmt,json]
  -loglevel string
        Log level [critical,error,warning,info,debug]
  -logtostderr
        log to standard error instead of files
  -session-ttl string
        leader session's TTL (default "60s")
  -stderrthreshold value
        logs at or above this threshold go to stderr
  -v value
        log level for V logs
  -vmodule value
        comma-separated list of pattern=N settings for file-filtered logging

Make sonobuoy great again

What

The current sonobuoy job takes too long (> 3 hours).
Life is short, so make it faster!

How

Use non-nested VMs.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Failed to update squid deployment with SSA

Describe the bug

After applying this release to the staging environment, CKE failed to apply the squid deployment.

This resource is applied with SSA and the failure is caused by the SSA conflict.
The corresponding CKE operation is resource-apply and the error message is Apply failed with 1 conflict: conflict with \"before-first-apply\" using apps/v1: .spec.template.spec.containers[name=\"squid\"].image.

Environments

Version: v1.17.1

To Reproduce
Steps to reproduce the behavior:

Run CKE v1.17.0
Update CKE to v1.17.1

Expected behavior
squid is deployed with no error.

Additional context
Add any other context about the problem here.

Drop docker support

What

Kubernetes 1.20 deprecates dockershim: kubernetes/kubernetes#94624
CKE should follow the path.

How

Describe how to address the issue.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Suspend/resume sabakan integration

What

ckecli has a subcommand ckecli sabakan disable to disable sabakan integration.
This also removes the sabakan URL registered by ckecli sabakan set-url, so this completely terminates integration.

There are use cases of suspending/resuming sabakan integration, e.g. in the maintenance of sabakan.
Support such use cases.

How

Add ckecli sabakan enable that restores sabakan integration.

ckecli sabakan disable will not remove the URL; instead it sets a disabled flag in etcd.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Kubernetes 1.20

What

Upgrade Kubernetes version from 1.19 to 1.20.

How

Upgrade container images
- CoreDNS, Kubernetes, cke-tools including CNI plugins, etcd, pause, unbound
Update containerd used in mtest
Update sonobuoy to v0.50.0
Add node-role.kubernetes.io/control-plane label
Remove failure-domain.beta.kubernetes.io/zone label
Drop docker support in #394
Use ComponentConfig for kube-proxy in #424

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Enable control plane to maintain during reboot

What

Maintenance operation does not run while reboot operation is running.

Endpoints are not updated if one control-plane is down.
controller-manager and scheduler do not recover if they crash.

How

Run reboot operation on another go routine

Modify rebootOp as follows
- after rebooting, remove cke.cybozu.com/reboot annotation
- wait until the node is schedulable
- remove the rebooted node from the reboot queue
Modify rebootUncordonOp as follows
- only uncordon for the node which is unschedulable and no cke.cybozu.com/reboot` annotation
Register records of the reboot operation and do the operation by using a goroutine in order to avoid record registration conflicts with other operations.
Disable sabakan integration for the certain period during a reboot operation in order to avoid the frequent master node change

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Generating certificates for webhooks

What

It would be very helpful if CKE can generate TLS certificates for admission webhooks,
and can embed them in Secret resources. The CA certificate should also be embedded
into ValidatingWebhookConfiguration and MutatingWebhookConfiguration.

How

Things to do:

Prepare a CA in Vault to issue certificates for webhooks.
Generalize user-resources to allow any kind of standard Kubernetes resources.
Issue TLS certificates and embed them in Secret.
Embed the CA certificate in ValidatingWebhookConfiguration and MutatingWebhookConfiguration.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Add a way to add pre-tainted empty node

What

Add a way to add a empty node on which no pods runs. It may be useful for benchmarking purpose.

maybe related issue: #539

How

Describe how to address the issue.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

vault token was revoked

Describe the bug

There might be a problem that the vault token is revoked if a CKE instance is down for a long time.

ckecli history showed the following logs:

{
    "id": "30906",
    "status": "cancelled",
    "operation": "etcd-wait-cluster",
    "command": {
        "name": "wait-etcd-sync",
        "target": "https://10.69.1.136:2379,https://10.69.0.14:2379,https://10.69.0.10:2379"
    },
    "targets": [
        "https://10.69.1.136:2379",
        "https://10.69.0.14:2379",
        "https://10.69.0.10:2379"
    ],
    "info": "",
    "error": "Error making API request.\n\nURL: GET https://127.0.0.1:8200/v1/cke/ca-etcd-client/roles/admin\nCode: 403. Errors:\n\n* permission denied",
    "start-at": "2021-03-16T05:46:53.049466554Z",
    "end-at": "2021-03-16T05:47:11.072880945Z"
}

Environments

Version:
OS:

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior

If the vault returns 403, CKE should exit.

Additional context
Add any other context about the problem here.

Hour-restricted reboot-queue processing

What

For example, reboot only at night.

note: It is just a workaround for instability due to node/pod restarting.

How

Describe how to address the issue.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Change etcd certificate

What

Currently, if the etcd container in a controller node is down, the kube-apiserver in the same node cannot send its status to CKE. Thus CKE does not try to recover the controller node.

How

Add etcdRiversOps, to manage rivers containers on control planes for etcd endpoint with 127.0.0.1 certificate.

Checklist

Finish implentation of the issue
Test all functions
Notify developers of necessary actions

Contain cke-tools

What

Currently, cke-tools is maintained in https://github.com/cybozu/neco-containers .
As cke-tools should be maintained together with CKE, the code should be moved to this repository.

How

Describe how to address the issue.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Question regarding metallb + ipvs

Hello,

Sorry to barge in, i usually check what you guys are doing as you have an extremely similar infrastructure as we do.
We ran into an issue with metallb + contour (well, mostly metallb related) and i'm wondering if you faced this before and how you solved it.

There seems to be an issue with kube-proxy blackholing load balancer external ips (creating entries for the metallb ip with no endpoints) when using externalTrafficPolicy: Local, this causes traffic inside the cluster to not be able to route, it works fine when coming from outside as the anycast will land on a node with an actual pod so there will be a route.

This is the metallb ticket: metallb/metallb#153
This is (one of) the kubernetes tickets: kubernetes/kubernetes#75262

Just curious if you found this issue in your infrastructure and how you solved it.

Thanks!

Kubernetes 1.22

What

Upgrade Kubernetes version to 1.22

How

Upgrade container images, including etcd
CoreDNS, Kubernetes, cke-tools including CNI plugins, etcd, pause, unbound
Update containerd used in mtest
Update sonobuoy

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Add canary release to reduce the impact of failures

What

When a resource or container image is changed, CKE restarts the container simultaneously on all nodes to reflect the change.
If a configuration file or container image is corrupted, the container will not work on all nodes, and the impact of failure is significant.

How

Implement canary release to reduce the impact of failures.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Add "--follow" option to "ckecli history"

What

I want to display cke history continuously.

How

Add --follow(-f)option to ckecli history.

If this option is specified, show the history in a new order, and continuously print new entries as they are appended to the history.

Checklist

Finish implementation of the issue
Test all functions
Notify developers of necessary actions

Update Docker Compose to v2

What

Update Docker Compose to v2. (We uses Docker Compose for Sonobuoy.)

https://github.com/cybozu-go/cke/blob/v1.22.9/sonobuoy/Makefile#L4

How

Describe how to address the issue.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

K8s 1.25: upgrade Kubernetes to 1.25

What

Upgrade Kubernetes version from 1.24 to 1.25.

How

Upgrade container images, including etcd
CoreDNS, Kubernetes, cke-tools including CNI plugins, etcd, pause, unbound
Update containerd used in mtest
Update sonobuoy
Do not use PSP

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

[KAIZEN] CI workflow fails randomly

What

Sometimes CI workflow fails randomly. Which jobs fail are different one by one.

How

TBD

Checklist

Finish implementation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

K8s 1.22: use policy/v1 for Eviction

What

The API version for policy resources, e.g. PodDisruptionBudget, was upgraded from policy/v1beta1 to policy/v1 at K8s 1.21.
The "pod/eviction" subresource, however, accepts "policy/v1" eviction requests only from K8s 1.22.
Update op/reboot.go to use policy/v1 package.

How

Describe how to address the issue.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Add time-range restrictions to processing of reboot queue

What

Rebooting worker nodes can cause a disruption of our cloud services.
We should improve the availability of the services in case of server crashes, but the improvement is only half done.
We need time-range restrictions to the processing of the reboot queue not to cause a service disruption during business hours.

How

Describe how to address the issue.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Fix the diagram of the control plane nodes

What

There is a diagram which means the control plane nodes. It's wrong due to
the lack of riverses between apiservers and etcds.

How

Fix this diagram.

Checklist

Finish implentation of the issue
Test all functions
Notify developers of necessary actions

ckecli ssh doesn't work with OpenSSH 8.6p1

What

ckecli ssh get stuck waiting for a private key from a fifo with OpenSSH 8.6p1.
OpenSSH 8.6p1 is installed on Flatcar 2905.2.1 , and mtest using Faltcat 2905.2.1 will fail because of this.

How

Fix ckecli ssh to work with OpenSSH 8.6p1.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

K8s 1.21: upgrade Kubernetes to 1.21

What

Upgrade Kubernetes version from 1.20 to 1.21.

How

Upgrade container images
- CoreDNS, Kubernetes, cke-tools including CNI plugins, etcd, pause, unbound
Update containerd used in mtest
Update sonobuoy to v0.53.2
Add DenyServiceExternalIPs to the list of additionally-enabled admission plugins; remove the default plugins from the list

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

K8s 1.26: upgrade Kubernetes to 1.26

What

Upgrade Kubernetes version from 1.25 to 1.26.

How

Upgrade container images, including etcd
CoreDNS, Kubernetes, cke-tools including CNI plugins, etcd, pause, unbound
Update containerd used in mtest
Update sonobuoy
Stop using kubescheduler.config.k8s.io/v1beta3

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Add ckecli rq reset-backoff command

What

When reboot-queue is stuck, normally we do some manual operation and wait for the next reboot.
The waiting time can be shortened if we can reset the backoff of a reboot entry.

How

Implement ckecli rq reset-backoff to reset the backoff count and time

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Please become Certified Kubernetes

You need to become Certified Kubernetes to use the name Cybozu Kubernetes Engine.

Please see https://github.com/cncf/k8s-conformance/ or email [email protected] with questions.

Cc @caniszczyk @taylorwaggoner

kubelet-bootstrap fails for a while

Describe the bug

When creating a k8s cluster for the first time, CKE fails at kubelet-bootstrap for a while.
It succeeds eventually, though.

ckecli history saids:

{
    "id": "14",
    "status": "cancelled",
    "operation": "kubelet-bootstrap",
    "command": {
        "name": "prepare-kubelet-files",
        "target": ""
    },
    "targets": [
        "10.69.0.5",
        (snip)
    ],
    "error": "Error making API request.\n\nURL: PUT https://127.0.0.1:8200/v1/cke/ca-kubernetes/issue/kubelet\nCode: 400. Errors:\n\n* unknown role: kubelet",
    "start-at": "2019-09-02T07:48:29.685644433Z",
    "end-at": "2019-09-02T07:48:31.700480225Z"
}

Environments

Version: CKE 1.15.0
OS: Ubuntu 18.04.3

To Reproduce
Steps to reproduce the behavior:

Start 30+ servers managed by sabakan.

Set CKE parameters.

ckecli constraints set control-plane-count 3
ckecli constraints set minimum-workers 30
ckecli constraints set maximum-workers 56
ckecli sabakan set-url http://localhost:10080

See error.

Expected behavior
kubelet-bootstrap succeeds immediately.

Additional context
The Vault log shows that many read operations for "cke/ca-kubernets/roles/kubelet" were issued simultaneously.
It seems that highly-parallel invocations of addRole() in pki.go confused the Vault.

Update kubernetes EndpointSlice before rebooting control plane node

What

Currently CKE does not run the "observe and update" loop while rebooting a node.
This means that it does not remove an unresponsive API server from default/kubernetes EndpointSlice while rebooting a control plane node.
#467 will solve this fundamentally, but it will be a big job.

Focus on default/kubernetes EndpointSlice and update it before rebooting a control plane node.

How

#512

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Kubernetes 1.19

What

Upgrade Kubernetes version from 1.18 to 1.19.

How

Upgrade kube-scheduler config to v1beta1.
- Note that v1alpha1/v1alpha2 are not supported in k8s 1.19, so we need to think about how to upgrade existing installations.
- For instance,
  1. Stop 1.18 CKE service.
  2. Upgrade ckecli
  3. Set a new cluster configuration with ckecli cluster set or ckecli sabakan set-template.
  4. Start 1.19 CKE service.
Generate a new cluster configuration from the template if any just after the startup.
- This is to support the above-mentioned upgrade scenario.
Upgrade container images
- CoreDNS, Kubernetes, cke-tools including CNI plugins, etcd, pause, unbound
Update containerd used in mtest
Update sonobuoy to v0.19.0
Move --volume-plugin-dir specification to KubeletConfiguration's VolumePluginDir field.

The Kubelet's --volume-plugin-dir option is now available via the Kubelet config file field VolumePluginDir. (#88480) [SIG Node]

CKE has some usage:

$ g grep volume-plugin
example/cke-cluster.yml:      - "--volume-plugin-dir=/var/lib/kubelet/volumeplugins"
example/cke-cluster.yml:      - "--flex-volume-plugin-dir=/var/lib/kubelet/volumeplugins"
op/k8s/kubelet_boot.go:         "--volume-plugin-dir=/opt/volume/bin",
sonobuoy/cke-cluster.yml.template:      - "--volume-plugin-dir=/var/lib/kubelet/volumeplugins"
sonobuoy/cke-cluster.yml.template:      - "--flex-volume-plugin-dir=/var/lib/kubelet/volumeplugins"

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Graceful restart node-dns

What

When restarting node-dns, you may not be able to resolve names.

How

Set so-reuseport (Probably it's enabled by default. https://manpages.debian.org/jessie/unbound/unbound.conf.5.en.html )
Set .spec.updateStrategy.rollingUpdate.maxSurge 1

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

Switch from no-longer-maintained gopass

What

CKE uses gopass, which is no longer maintained.
Find and use an alternative.

How

terminal is suggested. -> terminal is deprecated and moved to term.

Checklist

Finish implentation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions

README for mtest seems outdated

Describe the bug
I could not run mtest with the procedure described in mtest README.

Environments

Version: 1.20.4
OS: Ubuntu 20.04

To Reproduce
Follow the instructions in mtest README.
At step 3, ./mssh host1 will fail.

$ make clean
$ make setup
$ make placemat
$ chmod 600 mtest_key
$ ./mssh host1
ssh: connect to host 10.0.0.11 port 22: No route to host

Expected behavior
The README should contain up-to-date procedure.

Additional context
I could run mtest with make setup placemat bootstrap.

CKE hangs up when etcd session is closed

Describe the bug

When etcd stops, CKE hangs because closing of concurrency.Session is not checked.
This causes a failure of the leader election.

Environments

Version:
OS:

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

Question: Graceful Node Reboot

In Reboot Nodes Gracefully docs https://github.com/cybozu-go/cke/blob/main/docs/reboot.md#description
it says "CKE checks the existence of Job-managed Pods in the nodes. If there are the Pods on the nodes, CKE gives up on rebooting the nodes."

why do we need to care about Pods from Job resource? what is the consideration here?

Sorry if it's basic question, but I'd appreciated if you'd teach me