Code Monkey home page Code Monkey logo

kubeone's Introduction

Kubermatic KubeOne

KubeOne Report Card

Kubermatic KubeOne automates cluster operations on all your cloud, on-prem, edge, and IoT environments. KubeOne can install high-available (HA) master clusters as well single master clusters.

Getting Started

All user documentation for the latest stable version is available at the KubeOne docs website.

Information about the support policy (natively-supported providers, supported Kubernetes versions, and supported operating systems) can be found in the Compatibility document.

For a quick start, you should check the following documents:

Installing KubeOne

The fastest way to install KubeOne is to use the installation script:

curl -sfL get.kubeone.io | sh

The installation script downloads the release archive from GitHub, installs the KubeOne binary in your /usr/local/bin directory, and unpacks the example Terraform configs, addons, and helper scripts in your current working directory.

For other installation methods, check the Getting KubeOne guide on our documentation website.

Features

Easily Deploy Your Highly Available Cluster On Any Infrastructure

KubeOne works on any infrastructure out of the box. All you need to do is to provision the infrastructure and let KubeOne know about it. KubeOne will take care of setting up a production ready Highly Available cluster!

Native Support For The Most Popular Providers

KubeOne natively supports the most popular providers, including AWS, Azure, DigitalOcean, GCP, Hetzner Cloud, Nutanix, OpenStack, VMware Cloud Director, and VMware vSphere. The natively supported providers enjoy additional features such as integration with Terraform and Kubermatic machine-controller.

Kubernetes Conformance Certified

KubeOne is a Kubernetes Conformance Certified installer with support for all upstream-supported Kubernetes versions.

Declarative Cluster Definition

Define all your clusters declaratively, in a form of a YAML manifest. You describe what features you want and KubeOne takes care of setting them up.

Integration With Terraform

The built-in integration with Terraform, allows you to easily provision your infrastructure using Terraform and let KubeOne take all the needed information from the Terraform state.

Integration With Cluster-API, Kubermatic machine-controller, and operating-system-manager

Manage your worker nodes declaratively by utilizing the Cluster-API and Kubermatic machine-controller. Create, remove, upgrade, or scale your worker nodes using kubectl. Kubermatic operating-system-manager is responsibile for managing user-data for worker machines in the cluster.

Getting Involved

We very appreciate contributions! If you want to contribute or have an idea for a new feature or improvement, please check out our contributing guide.

If you want to get in touch with us and discuss about improvements and new features, please create a new issue on GitHub or connect with us over Slack:

Reporting Bugs

If you encounter issues, please create a new issue on GitHub or talk to us on the #kubeone Slack channel. When reporting a bug please include the following information:

  • KubeOne version or Git commit that you're running (kubeone version),
  • description of the bug and logs from the relevant kubeone command (if applicable),
  • steps to reproduce the issue,
  • expected behavior

If you're reporting a security vulnerability, please follow the process for reporting security issues.

Changelog

See the list of releases to find out about feature changes.

kubeone's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kubeone's Issues

Cleanup old etcd Backups

After a certain period (14 days or make it configurable in the KubeOne configuration) old etcd snapshots should be evicted from S3 to save space.

Sub-task of #2.

Handle cloud provider credentials

The machine-controller needs access to the credentials of the chosen cloud provider to actually do its thing and provision nodes. The credentials can come from whereever, but should be deployed as a secret into the cluster and then be consumed by the MC.

Sub-task of #4.

Idempotent retry not working

I just tried to use kubeone to setup a cluster, but it timedout while waiting for etcd (not part of this issue)
When retrying kubeone install I run into this error.

1st run

INFO[12:29:31 CET] Installing prerequisites…                                                                                                                       
INFO[12:29:31 CET] Determine operating system…                   node=3.121.214.48                                                                                 
INFO[12:29:32 CET] Installing kubeadm…                           node=3.121.214.48 os=ubuntu                                                                       
INFO[12:30:16 CET] Deploying configuration files…                node=3.121.214.48 os=ubuntu                                                                       
INFO[12:30:17 CET] Determine operating system…                   node=3.121.233.231                                                                                
INFO[12:30:18 CET] Installing kubeadm…                           node=3.121.233.231 os=ubuntu                                                                      
INFO[12:30:55 CET] Deploying configuration files…                node=3.121.233.231 os=ubuntu                                                                      
INFO[12:30:55 CET] Determine operating system…                   node=35.157.225.149                                                                               
INFO[12:30:57 CET] Installing kubeadm…                           node=35.157.225.149 os=ubuntu                                                                     
INFO[12:31:51 CET] Deploying configuration files…                node=35.157.225.149 os=ubuntu                                                                     
INFO[12:31:51 CET] Generating kubeadm config file…                                                                                                                 
INFO[12:31:53 CET] Initializing Kubernetes on leader…                                                                                                              
INFO[12:31:53 CET] Running kubeadm…                              node=3.121.214.48                                                                                 
INFO[12:32:48 CET] Generating PKI…                                                                                                                                 
INFO[12:32:48 CET] Running kubeadm…                              node=3.121.214.48                                                                                 
INFO[12:32:48 CET] Downloading PKI files…                        node=3.121.214.48                                                                                 
INFO[12:32:49 CET] Creating local backup…                        node=3.121.214.48                                                                                 
INFO[12:32:49 CET] Deploying PKI…
INFO[12:32:49 CET] Uploading files…                              node=3.121.233.231                                                                               
INFO[12:32:50 CET] Setting up certificates and restarting kubelet…  node=3.121.233.231                                                                            
INFO[12:32:53 CET] Uploading files…                              node=35.157.225.149                                                                              
INFO[12:32:54 CET] Setting up certificates and restarting kubelet…  node=35.157.225.149                                                                           
INFO[12:32:57 CET] Deploying PKI…
INFO[12:32:57 CET] Waiting for etcd to come up…                  node=3.121.233.231                                                                               
ERRO[12:34:58 CET] unable to join other masters a cluster: timed out while waiting for kube-system/etcd-ip-172-31-12-241 to come up for 2m0s 

2nd+ runs

INFO[12:35:16 CET] Installing prerequisites…
INFO[12:35:17 CET] Determine operating system…                   node=3.121.214.48
INFO[12:35:17 CET] Installing kubeadm…                           node=3.121.214.48 os=ubuntu
INFO[12:35:25 CET] Deploying configuration files…                node=3.121.214.48 os=ubuntu
INFO[12:35:26 CET] Determine operating system…                   node=3.121.233.231
INFO[12:35:26 CET] Installing kubeadm…                           node=3.121.233.231 os=ubuntu
INFO[12:35:33 CET] Deploying configuration files…                node=3.121.233.231 os=ubuntu
INFO[12:35:34 CET] Determine operating system…                   node=35.157.225.149
INFO[12:35:34 CET] Installing kubeadm…                           node=35.157.225.149 os=ubuntu
INFO[12:35:43 CET] Deploying configuration files…                node=35.157.225.149 os=ubuntu
INFO[12:35:44 CET] Generating kubeadm config file…
INFO[12:35:46 CET] Initializing Kubernetes on leader…
INFO[12:35:46 CET] Running kubeadm…                              node=3.121.214.48
ERRO[12:35:46 CET] failed to init kubernetes on leader: failed to exec command: Process exited with status 2: [preflight] Some fatal errors occurred:
        [ERROR Port-6443]: Port 6443 is in use
        [ERROR Port-10251]: Port 10251 is in use
        [ERROR Port-10252]: Port 10252 is in use
        [ERROR FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists
        [ERROR FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists
        [ERROR FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists
        [ERROR FileAvailable--etc-kubernetes-manifests-etcd.yaml]: /etc/kubernetes/manifests/etcd.yaml already exists
        [ERROR Port-10250]: Port 10250 is in use
        [ERROR Port-2379]: Port 2379 is in use
        [ERROR DirAvailable--var-lib-etcd]: /var/lib/etcd is not empty
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`

Check RBAC addon for missing permissions

The RBAC addon of Kubermatic supposedly contains additional roles and bindings to make the machine-controller work. The current roles let it start up, but could maybe not allow the kubelet to work in newly created nodes.

Create a machine resource and then observe if the MC can actually fully deploy and manage the new node.

Sub-task of #4.

User Documentation

Good software is nothing without adequate documentation. We need to ensure people can actually use KubeOne. For now, before we have a website or documentation site, we should keep it short but long enough to get started. Maybe markdown in a doc/ directory if the readme is not sufficient.

  • #21 - Create example configuration
  • #22 - Create a proper README

Second Proof-of-Concept Cloud Provider

We should not just test on AWS, but also have at least tested the installation on a second cloud provider like DigitalOcean. We want to make sure that our cluster setup steps are not provider-dependent.

At the end of this we should be able to confidently say "KubeOne works not just on AWS."

E2E tests for kubeone

User Story As a developer I'd like to write and run tests that would exercise kubernetes clusters created with kubeone.

Acceptance criteria:

  • test scenarios are written just like normal unit tests using standard go library.
  • conformance tests are executed as a basic suite of tests.
  • it builds and runs on prow

Tasks:

  • move out to prow (#95)
  • implement a package to execute e2e tests (#96)
  • remove drone configuration (#104)

EC2 Instances do not join the cluster

After the machine-controller has spawned instances on EC2, they come up successfully and install their dependencies, but then cannot properly register themselves as workers. They do show up in kubectl get nodes, but their journal shows

systemd[1]: Started Kubernetes transient mount for /var/lib/kubelet/pods/5f3ce42b-ecc1-11e8-90a0-021038c90d1a/volumes/kubernetes.io~secret/flannel-token-cxfwh.
kubelet[10287]: W1120 12:39:50.817264   10287 container.go:393] Failed to create summary reader for "/system.slice/run-reaef9c0c07504658b0fcf5055f42828f.scope": none of the resources are being tracked.
systemd[1]: Started Kubernetes transient mount for /var/lib/kubelet/pods/5f3c7d7a-ecc1-11e8-90a0-021038c90d1a/volumes/kubernetes.io~secret/kube-proxy-token-48x5t.
dockerd[8988]: time="2018-11-20T12:39:54Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/22b3b5ee3e1cdb85cb0337aef6295a86824e9296e2cd2598d5aec327e36c6ef4/shim.sock" debug=false pid=10562
dockerd[8988]: time="2018-11-20T12:39:54Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/4155aed9ad784ece271f588f51255514111be792ccab834d6056c29a53c6d32e/shim.sock" debug=false pid=10563
kubelet[10287]: W1120 12:39:55.552429   10287 cni.go:188] Unable to update cni config: No networks found in /etc/cni/net.d
kubelet[10287]: E1120 12:39:55.553004   10287 kubelet.go:2167] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
dockerd[8988]: time="2018-11-20T12:39:58Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/29e6e6d176ef400a639abef28fa8f54c82798f6eac9479f85fd109d8b86ee31e/shim.sock" debug=false pid=10711
kernel: ip_set: protocol 6
kubelet[10287]: W1120 12:40:00.554006   10287 cni.go:188] Unable to update cni config: No networks found in /etc/cni/net.d
kubelet[10287]: E1120 12:40:00.554551   10287 kubelet.go:2167] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
dockerd[8988]: time="2018-11-20T12:40:02Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/eea5790fc56bb94c3936c18882d2cc318c1ee208bdcb64f91363d797ab680dea/shim.sock" debug=false pid=10950
dockerd[8988]: time="2018-11-20T12:40:02Z" level=info msg="shim reaped" id=eea5790fc56bb94c3936c18882d2cc318c1ee208bdcb64f91363d797ab680dea
dockerd[8988]: time="2018-11-20T12:40:02.606592032Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
dockerd[8988]: time="2018-11-20T12:40:03.572962864Z" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap."
dockerd[8988]: time="2018-11-20T12:40:03Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/98058c9e8145823cacdf6380dccaebde593d47f4f46ca1c866aa993e03405092/shim.sock" debug=false pid=11058
systemd-udevd[11138]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
systemd-udevd[11138]: Could not generate persistent MAC address for flannel.1: No such file or directory
networkd-dispatcher[764]: WARNING:Unknown index 4 seen, reloading interface list
systemd-timesyncd[441]: Network configuration changed, trying to establish connection.
systemd-timesyncd[441]: Synchronized to time server 91.189.89.199:123 (ntp.ubuntu.com).
systemd-networkd[562]: flannel.1: Gained carrier
systemd-timesyncd[441]: Network configuration changed, trying to establish connection.
systemd-timesyncd[441]: Synchronized to time server 91.189.89.199:123 (ntp.ubuntu.com).
systemd-networkd[562]: flannel.1: Gained IPv6LL

Sub-task of #4.

E2E & Conformance Testing

Set up E2E & conformance testing for the minimal combination

  • #42 - provision and deprovision e2e infra with terraform in drone
  • #43 - run kubeone install in drone
  • #44 - run some basic kubectl commands on top of newly created e2e cluster from drone

Implement upgrade of the cluster

This issue tracks progress on implementing cluster upgrades with KubeOne.

Acceptance criteria:

  • Operator can upgrade KubeOne cluster using command such kubeone upgrade (#211)
  • kubeone upgrade takes care of upgrading workers (MachineDeployments) (#214)
  • kubeone upgrade takes care of upgrading components we deploy on the cluster, such as machine-controller and Ark (#236)
  • We're testing do upgrades work correctly as part of our E2E tests

Planning:

  • Write a proposal for handling cluster upgrades (#190)

Implementation:

  • Create kubeone upgrade CLI command (#196, #201)
  • Implement preflight checks (#206)
  • Implement core scripts and logic for running upgrades (package upgrades and kubeadm upgrade) (#211)
  • Upgrade MachineDeployments after upgrading control plane nodes (#214)

Consider:

  • Backup critical components before running upgrade process
  • Build add-on manager for managing and updating components we deploy on cluster, such as machine-controller and Ark

move out to prow

Simply move existing configuration from drone to prow for kubeone.

Document how we deploy and use Heptio Ark

Subtask of #2

We should document how we deploy and use Heptio Ark, so end-users understand how it works, why we use it, and how they can customize it.

Acceptance criteria:

The following points are documented:

  • How we deploy Ark
  • How user can customize deployment
  • How often backups are done and how frequency can be customized
  • How often backups are deleted and how it can be customized
  • Where backups are saved
  • What do we backup out-of-box and how

Warn on missing credentials

If no environment variables for the selected cloud provider are set (i.e. AWS_ACCESS_KEY_ID if the provider is "aws"), we should log a warning and notify the user that the machine-controller will most likely not work. For this we need to associate the env variables with their cloud provider and then just check if they're empty.

Modify etcd deployment to include Ark BackupHook used to backup etcd

Subtask of #2

It is up to be decided how do we want to handle backups for the the etcd cluster. When using Ark, there are two possibilities:

  • Ark-style: potentially non-consistent, but integrated into Ark. It is up to be researched what are drawbacks of using this method and how to improve consistency.
  • Custom, consistent snapshots. This may require a sidecar container.

Support Multiple Kubernetes Versions

To be easy to use, KubeOne should be able to handle multiple Kubernetes versions in one binary (i.e. do not have KubeOne 1.11 that can only create k8s 1.11 and KubeOne 1.12 that can only handle 1.12).

We do not want to spend unreasonable amounts of work supporting old Kubernetes versions. 1.12 is the goal and any older version is a nice-to-have.

  • #27 - Kubernetes 1.12
  • #28 - Kubernetes 1.11

Support for PROXY configuration

HTTP_PROXY, HTTPS_PROXY, NO_PROXY, all those ENV variables should be integrated into:

  • docker
  • kubelet
  • maybe something more?

That's required for air-gap environments

Figure out why machine-controller is not spawning nodes

Deploying the MC to a cluster and then creating a MachineDeployment will properly create a MachineSet and a number of Machines, but that's about it. No instances are being spawned.

Shutting the in-cluster MC down and starting a local one from the master branch works though.

Sub-task of #4.

Decide where to store information about backups and backups location

Subtask of #2

We need to decide where we want to store information related to backups. We have two important information to take care of:

  • Deploy Minio — Minio deployment should be optional and we need a way for user to specify should Minio be deployed and used
  • S3 information — for Ark we need URL of S3 bucket and S3 region provided

This may be extended in the future with more backup options, such as TTL for backups. A potential way of solving this could be adding fields to the manifest.yaml or to the Cluster config.

Acceptance criteria:

  • It is decided where we want to store information about backups and backups location

Fix worker machine creation

After deploying the machine-controller v0.9.9 to a cluster and then creating a MachineDeployment like so:

apiVersion: cluster.k8s.io/v1alpha1
kind: MachineDeployment
metadata:
  name: worker-machines-deployment
  namespace: kube-system
spec:
  replicas: 5
  selector:
    matchLabels:
      workerset: worker-machines
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      labels:
        workerset: worker-machines
      namespace: kube-system
    spec:
      providerConfig:
        value:
          cloudProvider: aws
          cloudProviderSpec:
            ami: ami-5970c436 # Ubuntu 18.04
            availabilityZone: eu-central-1a
            diskSize: 50
            diskType: gp2
            instanceType: t2.medium
            region: eu-central-1
            subnetId: subnet-2bff4f43
            vpcId: vpc-819f62e9
          operatingSystem: ubuntu
          operatingSystemSpec:
            distUpgradeOnBoot: false
      versions:
        kubelet: 1.12.2

  # these two are only here to prevent segfaults in the MC
  minReadySeconds: 0
  paused: false

Nothing really happens. The MC is creating a MachineSet and all 5 Machines, but then no instances are spawned on AWS EC2. Killing the MC in the cluster and starting a local one using the current master branch brings the machines up, though. (this problem is tackled in #35)

The machines boot up and create the kubelet-bootstrap services. Unfortunately, the kubelet cannot connect to the masters. The journal logs

systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 18.
systemd[1]: Stopping kubelet-healthcheck.service...
systemd[1]: Stopped kubelet-healthcheck.service.
systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
systemd[1]: Started kubelet: The Kubernetes Node Agent.
systemd[1]: Started kubelet-healthcheck.service.
health-monitor.sh[11895]: Start kubernetes health monitoring for kubelet
health-monitor.sh[11895]: Wait for 2 minutes for kubelet to be functional
kubelet[11894]: Flag --resolv-conf has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
kubelet[11894]: Flag --pod-manifest-path has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
kubelet[11894]: Flag --allow-privileged has been deprecated, will be removed in a future version
kubelet[11894]: Flag --authorization-mode has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
kubelet[11894]: Flag --client-ca-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
kubelet[11894]: Flag --rotate-certificates has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
kubelet[11894]: Flag --authentication-token-webhook has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
kubelet[11894]: Flag --read-only-port has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
kubelet[11894]: Flag --anonymous-auth has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
kubelet[11894]: Flag --protect-kernel-defaults has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
kubelet[11894]: Flag --cluster-dns has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
kubelet[11894]: Flag --cluster-domain has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
systemd[1]: Started Kubernetes systemd probe.
kubelet[11894]: I1119 16:06:54.986978   11894 server.go:408] Version: v1.12.2
kubelet[11894]: I1119 16:06:54.987250   11894 server.go:486] acquiring file lock on "/tmp/kubelet.lock"
kubelet[11894]: I1119 16:06:54.987397   11894 server.go:491] watching for inotify events for: /tmp/kubelet.lock
kubelet[11894]: I1119 16:06:54.987828   11894 aws.go:1042] Building AWS cloudprovider
kubelet[11894]: I1119 16:06:54.990634   11894 aws.go:1004] Zone not specified in configuration file; querying AWS metadata service
kubelet[11894]: E1119 16:06:55.146729   11894 tags.go:94] Tag "KubernetesCluster" nor "kubernetes.io/cluster/..." not found; Kubernetes may behave unexpectedly.
kubelet[11894]: F1119 16:06:55.146786   11894 server.go:262] failed to run Kubelet: could not init cloud provider "aws": AWS cloud failed to find ClusterID
systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
systemd[1]: kubelet.service: Failed with result 'exit-code'.

This is how the systemd units are configured:

ubuntu@ip-172-31-12-1:/etc/systemd/system$ cat kubelet-healthcheck.service
[Unit]
Requires=kubelet.service
After=kubelet.service
[Service]
ExecStart=/opt/bin/health-monitor.sh kubelet
[Install]
WantedBy=multi-user.target


ubuntu@ip-172-31-12-1:/etc/systemd/system$ cat kubelet.service
[Unit]
After=docker.service
Requires=docker.service
Description=kubelet: The Kubernetes Node Agent
Documentation=https://kubernetes.io/docs/home/
[Service]
Restart=always
StartLimitInterval=0
RestartSec=10
Environment="PATH=/opt/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin/"
ExecStart=/opt/bin/kubelet $KUBELET_EXTRA_ARGS \
  --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf \
  --kubeconfig=/etc/kubernetes/kubelet.conf \
  --pod-manifest-path=/etc/kubernetes/manifests \
  --allow-privileged=true \
  --network-plugin=cni \
  --cni-conf-dir=/etc/cni/net.d \
  --cni-bin-dir=/opt/cni/bin \
  --authorization-mode=Webhook \
  --client-ca-file=/etc/kubernetes/pki/ca.crt \
  --rotate-certificates=true \
  --cert-dir=/etc/kubernetes/pki \
  --authentication-token-webhook=true \
  --cloud-provider=aws \
  --cloud-config=/etc/kubernetes/cloud-config \
  --read-only-port=0 \
  --exit-on-lock-contention \
  --lock-file=/tmp/kubelet.lock \
  --anonymous-auth=false \
  --protect-kernel-defaults=true \
  --cluster-dns=10.10.10.10 \
  --cluster-domain=cluster.local
[Install]
WantedBy=multi-user.target

From what I've seen, we need to define a "cluster ID" since Kubernetes 1.10, but I could not find out how. It's supposed to be a label kubernetes.io/cluster/<ID>=owned, but this seems to be something either Terraform is supposed to do for the masters and then the MC is supposed to do for the worker instances.

This is a sub-task of #4.

Remove workers when resetting a cluster

The reset command should provide an optional flag to also destroy the worker nodes, so we can run e2e tests without having those machines still hanging around. As the workers may already have been used and contain volumes with data on them, we should not remove them by default, or else the reset command becomes way too dangerous.

Cluster Backups

This issue tracks progress on adding Cluster Backups as a feature. As discussed on Slack and on the team meetings, it is decided to use Heptio Ark to handle backups for both etcd and the cluster as a whole.

Acceptance criteria:

  • We're doing frequent etcd backups
  • We're doing frequent cluster backups
  • User can choose between storing backups on the cluster or in the cloud

Decisions to be made:

  • #52: Decide how we want to handle etcd backups
  • #53: Decide how to handle Ark manifests
  • #54: Decide where to store information about backups and backups location

Subtasks:

  • #49: Deploy Heptio Ark on KubeOne clusters (#61)
  • #132: Deploy Restic along with Ark to support snapshots of additional volume types (#136)
  • #50: Deploy Minio on KubeOne clusters
  • #52: Modify etcd deployment to include Ark BackupHook used to backup etcd (#127)
  • #24: Create Ark Schedule (CronJob) to automatically backup etcd (#130)
  • #25: Cleanup old etcd backups (#130)
  • #133: Support providing credentials for other Ark supported platforms
  • #51: Document how we deploy and use Heptio Ark

Deploy Minio on KubeOne clusters

Subtask of #2
Related to #49

In order to store backups created by Heptio Ark we may want to deploy Minio S3 storage where we are going to store backups locally on the cluster.

Acceptance criteria:

  • Minio is deployed on the clusters
  • Minio is deployed as an optional component
  • User is allowed to choose between Minio and external S3 storage

Create Ark backup operation CronJob

Subtask of #2

We need a CronJob that runs Ark periodically to backup the cluster and store backup on selected storage.

It is not possible to directly use Ark to backup the etcd cluster, as etcd doesn't allow direct access to its container. Instead, we need a sidecar that would run etcdctl to snapshot etcd cluster and save snapshot on the volume, and then use Ark to backup that volume.

Acceptance criteria:

  • CronJob which periodically backups cluster is created when Ark is deployed

Outstanding questions:

  • It is up to be decided how frequent we want to backup clusters.
  • It is up to be decided should backup interval be configurable by the end-user.

Resolving the issue:

  • Ark has the schedule command which we can use to automate the CronJob creation. The following command setups a CronJob to run backups daily:
ark schedule create <SCHEDULE NAME> --schedule "0 7 * * *"

Check spelling/grammar of error/status messages

Exhibit A:
ERRO[12:47:20 CET] unable to join other masters a cluster: timed out while waiting for kube-system/etcd-ip-172-31-12-241 to come up for 2m0s

I guess it would not spot this error, but maybe we should also add a automated spellchecker to CI/CD like in kubermatic?

WDYT @xrstf @kron4eg @xmudrii ?

Create example configuration

We should provide a config.yaml.dist for tech-savvy users to quickly get started without reading through a lenghty documentation. It should list all the available options with a short explanation.

This issue is not about reducing the number of options, but only about documenting it. Changing what is actually configured should be discussed in a different issue, if needed. Remember, the goal is not to configure each and every possible thing, but keep it sensible and rely on sane defaults. We should not add options before the need actually arises.

Sub-task of #3.

Support for Kubernetes 1.13

With Kubernetes 1.13 coming out today, we should plan 1.13 support for KubeOne. Comparing to 1.11 and 1.12, there may be breaking changes because 1.13 is leaving beta and graduating to GA, so we can't use the same bootstrap logic.

Deploy Heptio Ark on KubeOne clusters

Subtask of #2

We should deploy Heptio Ark on clusters in order to handle periodic backups for etcd and the whole cluster.

Acceptance criteria:

  • Ark is deployed automatically on all KubeOne cluster
  • Ark is capable of making backups on Minio-based storage
  • Ark is capable of making backups on cloud S3-based storage

The installation steps are described in the official documentation. In short, installation looks such as:

  • Download tarball from GitHub Releases and unpack it
  • Deploy Ark server
  • Verify is Ark running as intended

The Ark version we should use is v0.10.0.

Investigate DNS issues on Kubernetes 1.12 clusters

This is a postmortem and action issue for DNS issues we encountered on Kubernetes 1.12 clusters. Over the past two weeks we were debugging issues related to CoreDNS, kube-dns and Flannel, so I decided to summarize how the debugging process went, what we tried, and fix we implemented.

Postmortem

Week 1 (Nov 19th - Nov 23th)

Switch to kube-dns

  • We added support for Kubernetes 1.12 HA clusters in #12, provisioned following the official kubeadm documentation. At the time of implementation, we haven't noticed any issues.
  • Shortly after the PR got merged, we have noticed CoreDNS pods are stuck in CrashLoopBackoff with the following error message:
2018/11/13 10:34:15 [INFO] CoreDNS-1.2.2
2018/11/13 10:34:15 [INFO] linux/amd64, go1.11, eb51e8b
CoreDNS-1.2.2
linux/amd64, go1.11, eb51e8b
2018/11/13 10:34:15 [INFO] plugin/reload: Running configuration MD5 = f65c4821c8a9b7b5eb30fa4fbc167769
2018/11/13 10:34:15 [FATAL] plugin/loop: Seen "HINFO IN 9092728187714189058.3535744278062856330." more than twice, loop detected
  • We started investigating the issue as soon as possible. We have tried many fixes, including:
    • Setting --resolv-conf flag on all Kubelets
    • Verifying permissions, RBAC roles, and AWS settings
    • Following Troubleshooting guidelines for kubeadm
    • Updating CoreDNS to the latest version
    • Disabling the loop CoreDNS plugin
    • Asking for help on Kubernetes Slack (SIG-Cluster-Lifecycle)
    • Creating a plain cluster on DigitalOcean with no firewalls. Same issue encountered.
  • There were some reports of the issue, but the issue is very rare and very hard to reproduce.
  • After none fix made any success, we followed an idea from SIG-Cluster-Lifecycle and decided to disable CoreDNS and switch back to kube-dns by disabling the appropriate feature gate. kube-dns worked and there were no CrashLoopBackoffs. #14 has details about the initial investigation.

Addition of machine-controller

  • Deployment of MachineController has been added to KubeOne. We found out that MachineController were not creating any AWS worker node and started investigating. We found out that MachineController doesn't work only in-cluster, while it works when ran locally. We checked RBAC, AWS settings, and everything potentially related.
  • As a fall-back, we wanted to make sure credentials are being handled correctly and decided to deploy a cluster on DigitalOcean and to deploy DigitalOcean worker nodes. After deploying worker nodes MachineController would panic when trying to list existing Droplets.
  • We found out that there are no DNS in any pod and that MachineController can't resolve to the API. We started investigating immediately again and retried some of the fix we tried earlier. Trying to nslookup or ping any address would fail, such as:
nslookup: can't resolve '(null)': Name does not resolve

nslookup: can't resolve 'google.com': Try again
  • Similar, we found out that resolving Kubernetes Services would fail as well.
  • We checked are DNS issues present on AWS clusters and found out that there were present same as on DigitalOcean clusters.

Week 2 (Nov 26th - Nov 30th)

  • We continued investigating the issue further. We tried several fixes, including:
  • While none of the fixes helped, we started debugging CNI to make sure it is correctly working. Flannel pods were running and the correct Pod CIDR was set. No error logs for Flannel, neither for kube-dns.
  • As there was no success, we decided to re-enable CoreDNS and try to find is it the root of the problem.

Temporary issue resolution

On 11/28, we have resorted to asking again on SIG-Cluster-Lifecycle channel on Kubernetes Slack. While it is unknown and hard to debug why it happens, we got a recommendation to update CoreDNS config to switch from proxy . /etc/resolv.conf to proxy . 172.31.0.2.

After updating the config and restarting the CoreDNS pods, DNS resolution started working again. CoreDNS pods are not in the CrashLoopBackoff anymore and logs have not reported any error anymore.

Looking up internal Kubernetes services and external endpoints works:

$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      kubernetes.default
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local
$ kubectl exec -ti busybox -- nslookup google.com
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      google.com
Address 1: 2a00:1450:4001:815::200e
Address 2: 216.58.207.46 fra16s24-in-f14.1e100.net

Permanent solution

We have found out that the Kubelet environment is not initialized correctly for the second and the third master.

This is caused because we omitted the following kubeadm command:

kubeadm alpha phase kubelet write-env-file --config kubeadm-config.yaml

This command is used by the guide in the official documentation, however, that command has a bug with kubeadm 1.12. The example in the guide uses ClusterConfiguration object and we have kept that object as well, so command was not working

This is the error that command returns:

didn't recognize types with GroupVersionKind: [kubeadm.k8s.io/v1alpha3, Kind=ClusterConfiguration]

Instead of following up the issue, I have decided to omit the command for now until we don't find a fix. Instead, I tried to resort on systemd unit file, which seemed to work, but was not working actually. Today I have followed up with SIG-Cluster-Lifecycle and found what is this command doing, so I implemented it manually in #57.

Notes

This solution is a duct-tape solution and may cause problems in the future if we don't debug and find out why is that happening.

If any change happens to DNS, the in-cluster DNS would start failing again.

The issue may be fixed in Kubernetes 1.13, but this is to be found out once it is available.

Action items

  • Create a PR that updated CoreDNS configuration, so it starts working again (@xmudrii)
  • Verify is the problem present for Kubernetes 1.11 clusters and apply the fix if it is (@xmudrii)
  • Start investigating why this happens and how to fix the underlying problem
  • Report to MachineController maintainers that AWS API errors are not correctly propagated

This issue will be updated as we continue investigating.

/cc @xrstf @kron4eg @thetechnick @alvaroaleman

Automate creation of worker nodes

In contrast to master nodes, worker nodes must be created using the Cluste API, i.e. by deploying our machine-controller to the cluster and then creating machine resources. Who creates these machine resources is not part of this epic.

  • #19 - Create machine-controller deployment
  • #23 - Handle cloud provider credentials
  • #20 - Check RBAC addon for missing permissions
  • #35 - Figure out why machine-controller 0.9.9 is not spawning nodes
  • #34 - Fix worker machine creation
  • #39 - EC2 Instances do not join the cluster

Decide how to handle Ark manifests

Subtask of #2

We should decide how we want to handle and store Ark manifests. There are two possibilities:

  • Download Ark release tarball which contains all required manifest and deploy those manifests. It is good that we don't have long manifest templates in our code base, but we are limited a lot and can't easily change them.
  • Store all manifests used for Ark in pkg/templates like we do for other components. It is harder to maintain and later upgrade templates to newer versions, but it's easier to change manifests per our needs.
  • Hybrid approach — deploy prerequisites (CRDs and optionally Minio) from Ark manifests and other components from our templates.

In case of the second approach, additional question is do we want to keep the manifests — in form of YAML templates or in form of Go code?

Acceptance criteria:

  • It is decided what manifests we want to use for deploying
  • It is decided how we store those manifests in our code if needed

Invalid machine deployment env

In some cases kubeone tries to create the machine deployment with broken env entries.

   spec:
      containers:
      - args:
        - -logtostderr
        - -v
        - "4"
        - -internal-listen-address
        - 0.0.0.0:8085
        - -cluster-dns
        - 10.96.0.10
        command:
        - /usr/local/bin/machine-controller
        env:
        - name: ""
        - name: ""
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              key: AWS_SECRET_ACCESS_KEY
              name: machine-controller-credentials

Create a proper README

The Readme should contain

  • a short mission statement of what this project does (one/two sentences)
  • Features
  • Requirements (for runtime, not for building)
  • Building Instructions
  • Usage Instructions
  • Liense Note

Sub-task of #3.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.