cybozu-go / cke Goto Github PK
View Code? Open in Web Editor NEWCybozu Kubernetes Engine
License: Apache License 2.0
Cybozu Kubernetes Engine
License: Apache License 2.0
Currently we are using CoreOS Container Linux for Node OS.
CoreOS employs SELinux but its docker containers run with kernel_t
.
Under Red Hat, Fedora, or CentOS, docker containers are labelled with more restricted one.
Running mtest with CentOS Node improves our confidence for CKE's SELinux support.
Some tools in tests/ do not have tests.
For example, rivers
has no tests.
Without tests, we cannot update them reliably.
Describe how to address the issue.
Upgrade Kubernetes version to 1.24.
From changelog
Upgrade Kubernetes version to 1.23
The Kube-scheduler's --port
option is deplicated. Delete it.
https://github.com/cybozu-go/cke/blob/release-1.21/op/k8s/scheduler_boot.go#L111
The Kubelet's --register-with-taints
option is deplicated. Update the following document.
https://github.com/cybozu-go/cke/blob/release-1.21/docs/cluster.md#boot-taints
kubescheduler.config.k8s.io/v1beta1
will be deleted. Use v1beta3
.
https://github.com/cybozu-go/cke/blob/release-1.21/mtest/cke-cluster.yml#L27
https://github.com/cybozu-go/cke/blob/release-1.21/testdata/cluster.yaml#L43
Describe the bug
waitEtcdSyncCommand
in AddMemberOp
can fail due to timeout or some other reason.
Once waitEtcdSyncCommand
fails, VolumeCreateCommand
will be skipped and etcd-added-member
docker volume will never be created.
Environments
To Reproduce
docker volume remove etcd-added-member
.etcd-added-member
is left lacking.Expected behavior
etcd-added-member
is (re-)created after waiting for sync.
CKE updates static resources if cke.cybozu.com/revision is updated, so the revision needs to be updated manually when the resource is edited.
This is error-prone, in fact, the failure due to the revision update missing has occurred.
Detect differences of YAML files using equality.DeepDerivative, and if there're differences then server-side apply.
make cke reboot nodes smoothly.
CKE connect to nodes at the beginning of reconcile loop even if the nodes are rebooting. so:
I want to collect standard and runtime metrics.
Add prometheus.DefaultGatherer
to promhttp.HandlerFor
call.
ref:
Line 71 in 9af11fa
Currently, CKE does not use ComponentConfig for kube-proxy whereas almost all command-line flags of kube-proxy have been deprecated.
In fact, kube-proxy issues a warning:
WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP.
We need to replace command-line args with ComponentConfig for kube-proxy.
https://pkg.go.dev/k8s.io/[email protected]/config/v1alpha1#KubeProxyConfiguration
For reference, EKS has been using ComponentConfig for a while.
aws/containers-roadmap#143
Describe how to address the issue.
The "release" workflow uses ghr
to manipulate GitHub Releases.
We can now use GitHub's official command line tool gh
provided by the Ubuntu virtual environment.
Describe how to address the issue.
Describe the bug
waitEtcdSyncCommand
in AddMemberOp
can fail due to timeout or some other reason.
Once waitEtcdSyncCommand
fails, VolumeCreateCommand
will be skipped and etcd-added-member
docker volume will never be created.
Environments
To Reproduce
docker volume remove etcd-added-member
.etcd-added-member
is left lacking.Expected behavior
etcd-added-member
is (re-)created after waiting for sync.
Add a way to add nodes with specific sabakan labels to the cluster.
maybe related issue: #509
Describe how to address the issue.
Currently, kubelet and kube-proxy start after kube-controller-manager and kube-scheduler.
This could decrease the chances for Pods to be distributed among a large number of Nodes.
Reorder the operations so that Node resources get created before kube-scheduler and kube-controller-manager.
Describe how to address the issue.
Try slok/noglog to resolve this problem:
$ cke -h
Usage of cke:
-alsologtostderr
log to standard error as well as files
-config string
configuration file path (default "/etc/cke/config.yml")
-http string
<Listen IP>:<Port number> (default "0.0.0.0:10180")
-interval string
check interval (default "10m")
-log_backtrace_at value
when logging hits line file:N, emit a stack trace
-log_dir string
If non-empty, write log files in this directory
-logfile string
Log filename
-logformat string
Log format [plain,logfmt,json]
-loglevel string
Log level [critical,error,warning,info,debug]
-logtostderr
log to standard error instead of files
-session-ttl string
leader session's TTL (default "60s")
-stderrthreshold value
logs at or above this threshold go to stderr
-v value
log level for V logs
-vmodule value
comma-separated list of pattern=N settings for file-filtered logging
The current sonobuoy job takes too long (> 3 hours).
Life is short, so make it faster!
Describe the bug
After applying this release to the staging environment, CKE failed to apply the squid deployment.
This resource is applied with SSA and the failure is caused by the SSA conflict.
The corresponding CKE operation is resource-apply
and the error message is Apply failed with 1 conflict: conflict with \"before-first-apply\" using apps/v1: .spec.template.spec.containers[name=\"squid\"].image
.
Environments
To Reproduce
Steps to reproduce the behavior:
Expected behavior
squid
is deployed with no error.
Additional context
Add any other context about the problem here.
Kubernetes 1.20 deprecates dockershim: kubernetes/kubernetes#94624
CKE should follow the path.
Describe how to address the issue.
ckecli
has a subcommand ckecli sabakan disable
to disable sabakan integration.
This also removes the sabakan URL registered by ckecli sabakan set-url
, so this completely terminates integration.
There are use cases of suspending/resuming sabakan integration, e.g. in the maintenance of sabakan.
Support such use cases.
Add ckecli sabakan enable
that restores sabakan integration.
ckecli sabakan disable
will not remove the URL; instead it sets a disabled flag in etcd.
Upgrade Kubernetes version from 1.19 to 1.20.
node-role.kubernetes.io/control-plane
labelfailure-domain.beta.kubernetes.io/zone
labelMaintenance operation does not run while reboot operation is running.
Run reboot operation on another go routine
cke.cybozu.com/reboot
annotationIt would be very helpful if CKE can generate TLS certificates for admission webhooks,
and can embed them in Secret
resources. The CA certificate should also be embedded
into ValidatingWebhookConfiguration
and MutatingWebhookConfiguration
.
Things to do:
Secret
.ValidatingWebhookConfiguration
and MutatingWebhookConfiguration
.Add a way to add a empty node on which no pods runs. It may be useful for benchmarking purpose.
maybe related issue: #539
Describe how to address the issue.
Describe the bug
There might be a problem that the vault token is revoked if a CKE instance is down for a long time.
ckecli history
showed the following logs:
{
"id": "30906",
"status": "cancelled",
"operation": "etcd-wait-cluster",
"command": {
"name": "wait-etcd-sync",
"target": "https://10.69.1.136:2379,https://10.69.0.14:2379,https://10.69.0.10:2379"
},
"targets": [
"https://10.69.1.136:2379",
"https://10.69.0.14:2379",
"https://10.69.0.10:2379"
],
"info": "",
"error": "Error making API request.\n\nURL: GET https://127.0.0.1:8200/v1/cke/ca-etcd-client/roles/admin\nCode: 403. Errors:\n\n* permission denied",
"start-at": "2021-03-16T05:46:53.049466554Z",
"end-at": "2021-03-16T05:47:11.072880945Z"
}
Environments
To Reproduce
Steps to reproduce the behavior:
Expected behavior
If the vault returns 403, CKE should exit.
Additional context
Add any other context about the problem here.
For example, reboot only at night.
note: It is just a workaround for instability due to node/pod restarting.
Describe how to address the issue.
Currently, if the etcd
container in a controller node is down, the kube-apiserver
in the same node cannot send its status to CKE. Thus CKE does not try to recover the controller node.
Add etcdRiversOps, to manage rivers containers on control planes for etcd endpoint with 127.0.0.1
certificate.
Currently, cke-tools is maintained in https://github.com/cybozu/neco-containers .
As cke-tools should be maintained together with CKE, the code should be moved to this repository.
Describe how to address the issue.
Hello,
Sorry to barge in, i usually check what you guys are doing as you have an extremely similar infrastructure as we do.
We ran into an issue with metallb + contour (well, mostly metallb related) and i'm wondering if you faced this before and how you solved it.
There seems to be an issue with kube-proxy blackholing load balancer external ips (creating entries for the metallb ip with no endpoints) when using externalTrafficPolicy: Local, this causes traffic inside the cluster to not be able to route, it works fine when coming from outside as the anycast will land on a node with an actual pod so there will be a route.
This is the metallb ticket: metallb/metallb#153
This is (one of) the kubernetes tickets: kubernetes/kubernetes#75262
Just curious if you found this issue in your infrastructure and how you solved it.
Thanks!
Upgrade Kubernetes version to 1.22
When a resource or container image is changed, CKE restarts the container simultaneously on all nodes to reflect the change.
If a configuration file or container image is corrupted, the container will not work on all nodes, and the impact of failure is significant.
Implement canary release to reduce the impact of failures.
I want to display cke history continuously.
Add --follow
(-f
)option to ckecli history
.
If this option is specified, show the history in a new order, and continuously print new entries as they are appended to the history.
Update Docker Compose to v2. (We uses Docker Compose for Sonobuoy.)
https://github.com/cybozu-go/cke/blob/v1.22.9/sonobuoy/Makefile#L4
Describe how to address the issue.
Upgrade Kubernetes version from 1.24 to 1.25.
Sometimes CI workflow fails randomly. Which jobs fail are different one by one.
TBD
The API version for policy resources, e.g. PodDisruptionBudget, was upgraded from policy/v1beta1 to policy/v1 at K8s 1.21.
The "pod/eviction" subresource, however, accepts "policy/v1" eviction requests only from K8s 1.22.
Update op/reboot.go
to use policy/v1 package.
Describe how to address the issue.
Rebooting worker nodes can cause a disruption of our cloud services.
We should improve the availability of the services in case of server crashes, but the improvement is only half done.
We need time-range restrictions to the processing of the reboot queue not to cause a service disruption during business hours.
Describe how to address the issue.
There is a diagram which means the control plane nodes. It's wrong due to
the lack of riverses between apiservers and etcds.
Fix this diagram.
ckecli ssh get stuck waiting for a private key from a fifo with OpenSSH 8.6p1.
OpenSSH 8.6p1 is installed on Flatcar 2905.2.1 , and mtest using Faltcat 2905.2.1 will fail because of this.
Fix ckecli ssh to work with OpenSSH 8.6p1.
Upgrade Kubernetes version from 1.20 to 1.21.
DenyServiceExternalIPs
to the list of additionally-enabled admission plugins; remove the default plugins from the listUpgrade Kubernetes version from 1.25 to 1.26.
kubescheduler.config.k8s.io/v1beta3
When reboot-queue is stuck, normally we do some manual operation and wait for the next reboot.
The waiting time can be shortened if we can reset the backoff of a reboot entry.
Implement ckecli rq reset-backoff
to reset the backoff count and time
You need to become Certified Kubernetes to use the name Cybozu Kubernetes Engine.
Please see https://github.com/cncf/k8s-conformance/ or email [email protected] with questions.
Describe the bug
When creating a k8s cluster for the first time, CKE fails at kubelet-bootstrap for a while.
It succeeds eventually, though.
ckecli history
saids:
{
"id": "14",
"status": "cancelled",
"operation": "kubelet-bootstrap",
"command": {
"name": "prepare-kubelet-files",
"target": ""
},
"targets": [
"10.69.0.5",
(snip)
],
"error": "Error making API request.\n\nURL: PUT https://127.0.0.1:8200/v1/cke/ca-kubernetes/issue/kubelet\nCode: 400. Errors:\n\n* unknown role: kubelet",
"start-at": "2019-09-02T07:48:29.685644433Z",
"end-at": "2019-09-02T07:48:31.700480225Z"
}
Environments
To Reproduce
Steps to reproduce the behavior:
ckecli constraints set control-plane-count 3
ckecli constraints set minimum-workers 30
ckecli constraints set maximum-workers 56
ckecli sabakan set-url http://localhost:10080
Expected behavior
kubelet-bootstrap
succeeds immediately.
Additional context
The Vault log shows that many read operations for "cke/ca-kubernets/roles/kubelet" were issued simultaneously.
It seems that highly-parallel invocations of addRole()
in pki.go confused the Vault.
Currently CKE does not run the "observe and update" loop while rebooting a node.
This means that it does not remove an unresponsive API server from default/kubernetes
EndpointSlice while rebooting a control plane node.
#467 will solve this fundamentally, but it will be a big job.
Focus on default/kubernetes
EndpointSlice and update it before rebooting a control plane node.
Upgrade Kubernetes version from 1.18 to 1.19.
ckecli
ckecli cluster set
or ckecli sabakan set-template
.--volume-plugin-dir
specification to KubeletConfiguration's VolumePluginDir field.The Kubelet's --volume-plugin-dir option is now available via the Kubelet config file field VolumePluginDir. (#88480) [SIG Node]
CKE has some usage:
$ g grep volume-plugin
example/cke-cluster.yml: - "--volume-plugin-dir=/var/lib/kubelet/volumeplugins"
example/cke-cluster.yml: - "--flex-volume-plugin-dir=/var/lib/kubelet/volumeplugins"
op/k8s/kubelet_boot.go: "--volume-plugin-dir=/opt/volume/bin",
sonobuoy/cke-cluster.yml.template: - "--volume-plugin-dir=/var/lib/kubelet/volumeplugins"
sonobuoy/cke-cluster.yml.template: - "--flex-volume-plugin-dir=/var/lib/kubelet/volumeplugins"
When restarting node-dns, you may not be able to resolve names.
.spec.updateStrategy.rollingUpdate.maxSurge
1Describe the bug
I could not run mtest with the procedure described in mtest README.
Environments
To Reproduce
Follow the instructions in mtest README.
At step 3, ./mssh host1
will fail.
$ make clean
$ make setup
$ make placemat
$ chmod 600 mtest_key
$ ./mssh host1
ssh: connect to host 10.0.0.11 port 22: No route to host
Expected behavior
The README should contain up-to-date procedure.
Additional context
I could run mtest with make setup placemat bootstrap
.
Describe the bug
When etcd stops, CKE hangs because closing of concurrency.Session
is not checked.
This causes a failure of the leader election.
Environments
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.
In Reboot Nodes Gracefully docs https://github.com/cybozu-go/cke/blob/main/docs/reboot.md#description
it says "CKE checks the existence of Job-managed Pods in the nodes. If there are the Pods on the nodes, CKE gives up on rebooting the nodes."
why do we need to care about Pods from Job resource? what is the consideration here?
Sorry if it's basic question, but I'd appreciated if you'd teach me
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.