cncf / demo Goto Github PK

View Code? Open in Web Editor NEW

77.0 35.0 39.0 1.67 MB

Demo of CNCF technologies

Home Page: https://cncf.io

License: Apache License 2.0

Shell 0.64% Python 1.12% Lua 0.02% JavaScript 94.30% HTML 0.50% CSS 0.29% HCL 3.12%

cloud-native kubernetes cncf

demo's Introduction

CNCF Technologies Demonstration

The goal of this project is to demonstrate each of the technologies that have been adopted by the Cloud Native Computing Foundation (CNCF) in a publicly available repository in order to facilitate their understanding through simple deployment tooling and by providing sample applications as common-ground for conversation. This project will enable replicable deployments and facilitate quantification of performance, latency, throughput, and cost between various deployment models.

Beta Preview

Demo run results are displayed at: beta.cncfdemo.io.

Project Overview
- Sample Applications
- Deployment Models
Getting Started
Architecture
- Kubernetes Architecture
Postmortem & Suggestions

Technologies

Kubernetes - Project, Source
Prometheus - Project, Source

Summary of Sample Applications

Count.ly - Project, Source

Goals:
1. demonstrate autoscaling of Countly
2. illustrate background, lower priority jobs vs foreground, higher priority jobs
Details of sample application

Boinc - Project, Source

Goals:
1. demonstrate grid computing use case
2. contribute cycles to curing Zika in IBM World Community Grid
Details of sample application

Supported Deployments

A variety of deployments models will be supported. Support for each deployment model will be delivered in this order of priority:

Local (on your machine)
CNCF Community Cluster
AWS
Azure
GCP
Packet

Given this breadth of supported deployment models using the same sample applications, performance, cost, etc. characteristics between this variety of clusters may be compared.

CNCF Community Cluster

The CNCF Community Infrastructure Lab (CIL) provides free access to state-of-the-art computing resources for open source developers working to advance cloud native computing. We currently offer access to both x86 and ARMv8 bare metal servers for software builds, continuous integration, scale testing, and demonstrations.

The on-demand infrastructure resource is generously contributed and managed by New York City-based Packet, a leading bare metal cloud, as part of its commitment to the cloud native and open source communities. The resources are available from 15 locations across the globe; including New York City, Silicon Valley, Amsterdam, and Tokyo. It allows developers extended testing or the ability to build out continuous integrated infrastructure with the automation and consistency of big public clouds without being required to use virtualization.

An Open Commitment

The project output will be an open source GitHub repo that will become widely referenced within the CNCF community. All work will occur on a public repo, all externally referenced projects will be open source, and this project itself will be licensed under Apache 2.0.

Disclaimer

Note that these are explicitly marketing demos, not reference stacks. The CNCF’s Technical Oversight Committee will over time be adopting additional projects and may eventually publish reference stacks. By contrast, this project is designed to take the shortest possible path to successful multi-cloud deployments of diverse applications.

Quick Start Guide _{^{(back to TOC)}}

Getting started with the cncfdemo is a three-step process:

Install cncfdemo
Create a Kubernetes cluster, running Prometheus
Run demo apps & benchmarks

1. Install cncfdemo

Run pip install cncfdemo

pip is the python package manager. It is strongly recommended to also use a dedicated python virtualenv. For detailed install instructions for your platform read: The Hitchhiker's Guide to Python.

2. Create Cluster

cncfdemo bootstrap aws

AWS is used as an example. Substitute with your provider of choice. _{Note: Grab a beverage, this step takes several minutes.}

3. Run Demo

Run cncfdemo start
Browse to countly.cncfdemo.io
Run cncfdemo benchmark --time 5m --save-html

The `cncfdemo` command shadows and complements the official Kubectl binary.

❯ cncfdemo create configmap example --from-file=path/to/directory

❯ kubectl create configmap example --from-file=path/to/directory

cncfdemo is written in Python and like Kubectl interacts with the remote REST API server. Unlike Kubectl, it supports HTTP only. Further differing from kubectl it is able to create new clusters on your favorite cloud provider (or even bare metal).

Complex, Scriptable Kubernetes Deployments & Jinja Templating

In addition to the ability to quickly spin up new clusters from scratch the cncfdemo command comes with a built in demo of a complex multistep multicomponent deployment.

When you run:

❯ cncfdemo start

The following is going on behind the scenes:

Prometheus and its Pushgateway are deployed
Grafana is deployed with preconfigured dashboards to expose metrics collected by Prometheus
ConfigMaps are created from autodetection of configuration files required by the applications being deployed
A sharded mongo cluster is provisioned
One Shot Kubernetes Jobs initialize and configure the mongo cluster
A mongos service is exposed internally to the cluser
Multiple instances of Countly are spun up against the mongo cluster
Countly service is exposed at a human readable subdomain countly.cncfdemo.io via Route53
HTTP Benchmarking is performed against the Countly subdomain via WRK jobs
Idle cluster capacity to search for a cure to the Zika virus is donated via Boinc and IBM WorldCommunityGrid

The demo described above is difficult and brittle to put together with regular kubectl usage. Editing YAML files by hand is time consuming and error prone.

Behind the scenes

The demo was accomplished with Jinja templating, several advanced kubernetes primitives & patterns that are currently in Alpha, and extending and adding some functionality to the cncfdemo wrapper - all in order to greatly simplify and reduce the number of commands required to accomplish a complex deployment.

Future Plans

Additional cloud providers support
A visualization/UI layer to display the progress of cluster bootstraps, deployments, and benchmarks

Architecture _{^{(back to TOC)}}

Image based Kubernetes deployments

The Kubernetes project consists of half a dozen standalone binaries, copied to their appropriate location along with associated Systemd unit files*.

	Master	Minion
kube-apiserver	✔
kube-controller-manager	✔
kube-scheduler	✔
kube-proxy		✔
kubelet		✔
kubectl (No service file)

The first three belong on master nodes, kube-proxy and kubelet belong on minions, and kubectl is just an optional handy utility to have on the path.

Instead of cutting seperate images for masters and minions we rely on Cloud-init -- the defacto multi-distribution package that handles early initialization of a cloud instance -- and Systemd drop-in files to tag an instance as a master or minion.

Systemd drop-in files

The Kubernetes unit files are written upstream and should work on any distro that supports systemd. There's no need to edit them directly, they are as static as their associated binaries.

Instead, we want to override only specific directives from these unit files. Systemd has a mechanism that picks up drop-in files and appends or modifies a unit file's directives.

So for example, an upstream provided unit file for kube-apiserver exists at /lib/systemd/system/kube-apiserver.service, we simply add a file at /lib/systemd/system/kube-apiserver.service.d/role.conf with the contents:

ConditionPathExists=/etc/sysconfig/kubernetes-master

At boot Systemd will essentially merge role.conf into the original unit file, and start the kube-apiserver service based on whether or not a file exists at /etc/sysconfig/kubernetes-masters (This is called path based activation).

With this baked into a server image (by a tool like Packer) all that is left is to specify how many copies we want to run and tell cloud-init to create the file. This functionality is common to basically any modern distro and cloud provider, and library (like boto) or provisioning tool (like Terraform).

For example on AWS:

aws ec2 run-instances --image-id ami-424242 --count 3 --user-data 'touch /etc/sysconfig/kubernetes-master'

Other useful settings cloud-init can override

The clustername - Kubernetes clusters require a unique id
The url to pull the addons manager from (so it doesn't have to be baked into the image)
The master endpoint
- not recommended but useful for testing; The preferable approach is to provision the cloud environment to route by convention masters.{clustername}.domainame
Other endpoints for customized images that include things like fluentd forwarding logs to S3, The cncfdemo backend API, 'etc.

Kubernetes Architecture

This document will walk you through setting up Kubernetes. This guide is for people looking for a fully automated command to bring up a Kubernetes cluster (In fact, this is the basis for the cncfdemo command utility and you can use that directly or learn how to make your own).

Kubernetes is currently experiencing a cambrian explosion¹ of deployment methodologies. Described below is an opinionated approach with three guiding principles:

Minimal dependencies
Portability between cloud providers, baremetal, and local, with minimal alteration
Image based deployments bake all components into one single image

If you just want to try it out skip to the Quick start.

Three Groups

Kubernetes components are neatly split up into three distinct groups*.

_{Diagram of a highly available kubernetes cluster}

etcd is a reliable distributed key-value store, its where the cluster state is kept. The Source of Truth. All other parts of Kubernetes are stateless. You could (and should) deploy and manage an etcd cluster completely independently, just as long as Kubernetes masters can connect and use it.

Lets zoom in further on one of those circles representing a Kubernetes minion.

_{_{*AWS AutoScalingGroups, GCE "Managed Instance Groups", Azure "Scale Sets"}}

Cluster bootstrap via DNS discovery

Later on you will see DNS discovery of services being used extensively within Kubernetes clusters as a core feature. But lets continue with the bootstrap, as per the table a minion has only a kubelet and a kube-proxy.

_{Contents of kubelet.service:}

[Service]
EnvironmentFile=/etc/kubernetes/config
EnvironmentFile=/etc/kubernetes/kubelet
EnvironmentFile=/etc/sysconfig/kubernetes-minions
ExecStartPre=/bin/bash -c "until /usr/bin/curl -s http://masters.${CLUSTER_NAME}.k8s:8080; do echo \"waiting for master...\"; sleep 5; done"
ExecStart=/bin/kubelet \
            --api-servers=http://masters.${CLUSTER_NAME}.k8s:8080 \
            $CLOUD_PROVIDER \
            $KUBELET_ARGS

The precondition ensures the master is reachable and responsive. The endpoint is by convention 'masters.cncfdemo.k8s'.

There are many other ways to do this, however, this approach is not provider specific.

Making the bookkeeping orthogonal to the deployment process

For AWS the principles are outlined in Building a Dynamic DNS for Route 53 using CloudWatch Events and Lambda.

The process is reduced to the following;

Configure CloudWatch to trigger a Lambda function on any and all AutoScalingGroup events
Lambda function simply sets a DNS record set of the private 'k8s' domain to reflect the list of healthy instances in that group

As a result, an AutoScalingGroup with Tags: KubernetesCluster, Role, will always have membership correctly reflected via '{Role}.{KubernetesCluster}.k8s' DNS lookups.

Software-defined networking for minions

At this point you might be wondering why minions.{KubernetesCluster}.k8s is needed. The masters subdomain is useful because the minions need to point at the masters. But who needs to point at minions? The answer is the other minions.

Kubernetes has a distinctive networking model.

Kubernetes allocates an IP address to each pod. When creating a cluster, you need to allocate a block of IPs for Kubernetes to use as Pod IPs. _{-- Kubernetes Docs}

In order to let Pod A (10.32.0.1) from one minion node (172.20.0.1) communicate with Pod B (10.32.0.3) on another minion node (172.20.0.2) we use an Overlay network. It is possible to achieve this sort of routing without an overlay network (and associated performance penalty) but an overlay is simpler to configure and more importantly it is portable.

CNI, the Container Network Interface, is a proposed standard for configuring network interfaces for Linux application containers. CNI is supported by Kubernetes, Apache Mesos and others.

Enabling CNI

Required directories for CNI plugin:

/opt/cni/bin
/etc/cni/net.d

The default cni plugin binaries need to be placed in /opt/cni/bin/. We have opted to use Weave, its setup script adds weave binaries into this directory as well.

Finally, we direct the Kubelet to use the above:

KUBELET_ARGS="--network-plugin=cni --network-plugin-dir=/etc/cni/net.d --docker-endpoint=unix:///var/run/weave/weave.sock"

Weave Quorum

Kubernetes will now rely on the Weave service to allocate the IPs in the overlay network.

PEERS=$(getent hosts minions.cncfdemo.k8s | awk '{ printf "%s ", $1 }')

MEMBERS=$(getent hosts minions.cncfdemo.k8s | wc -l)

/usr/local/bin/weave launch-router --ipalloc-init consensus=$MEMBERS ${PEERS}

You can read further details on Weave initialization strategies. We are using the consensus strategy. In keeping with our example:

PEERS=172.20.0.63 172.20.0.64
MEMBERS=2

Weave Net uses the estimate of the number of peers at initialization to compute a majority or quorum number – specifically floor(n/2) + 1.

If the actual number of peers is less than half the number stated, then they keep waiting for someone else to join in order to reach a quorum.

Once the quorum has been reached you can see how the IP allocation has been divvied up between the members.

weave status ipam

82:85:7e:7f:71:f3(ip-172-20-0-63) 32768 IPs (50.0% of total)

ce:38:5e:9d:35:ab(ip-172-20-0-64) 32768 IPs (50.0% of total)

_{For a deeper dive on how this mechanism works: Distributed systems with (almost) no consensus.}

Details of Sample Applications

Countly

Countly is an open source web & mobile analytics and marketing platform. It provides insights about user actions.

Configuration Files

Two configuration files used to dictate the behavior of this demo application are api.js and frontend.js. Each configuration file contains only one change from the default configurationonly line changed from the default config:

Host: "mongos:default"

By setting Host to "mongos.default", the Countly application looks for its MongoDB servers at the address "mongos.default". The "mongos.default" reference resolves to a Kubernetes service called mongos. The .default namespace is the default top-level domain for pods and services deployed in Kubernetes.

Deep Dive

Patterns and Best Practices
- How to adapt your app to run in Kubernetes (Countly example in detail)
- Clustered Datastores on top of Kubernetes (Mongo example in detail)
- Making use of spare capacity with Background Jobs (Boinc example in detail)
Complex, Scriptable Kubernetes Deployments & Jinja Templating

What actually happens when you 'cncfdemo start'

Notes on Containerizing Apps

Picking a base image

Inevitably when working with containers the question of the base image comes up. Is it better to opt for the spartan minimalism of Alpines and Busyboxes or a bog standard 700MB CentOS?

Should you split your app into multiple containers each running a single process or bake everything into one blob?

The (typical engineer) answer is "it depends".

Take a complex app like Countly for example. To package it up conveniently, so a developer can quickly try it out on her laptop for instance, it is necessary to bundle Mongo, Nginx, NodeJS, the Countly API server app, and the dashboard UI app.

Single process per container or... not

You can't always run one process per container. What you really might crave in such a situation is a process control system or even a proper init.

Traditionally a Docker container runs a single process when it is launched, for example an Apache daemon or a SSH server daemon. Often though you want to run more than one process in a container. There are a number of ways you can achieve this ranging from using a simple Bash script as the value of your container’s CMD instruction to installing a process management tool. _{- Docker's documentation on using supervisord}

There's several such supervisors, a popular one being runit. Runit is written in C and uses less resources than supervisord, adheres to the unix philosophy of utilities doing one thing well, and is very reliable.

Resolving the PID 1 problem

There's a subtle problem of Docker and PID 1 zombie reaping the aforementioned process supervisors alone don't solve.

The Ubuntu based phusion baseimage works around this with a small (340 line) my_init script.

Ideally, the PID 1 problem is solved natively by Docker. It would be great if Docker supplies some builtin init system that properly reaps adopted child processes. But as of January 2015, we are not aware of any effort by the Docker team to address this.

As of September 20, 2016, this is finally fixed by Docker upstream with an optional small new daemon that gets injected with --init=true.

Customizing Countly for Kubernetes

Countly provides an official docker image based on Phusion the advantages and considerations of which are outlined above.

We extend it and simply keep the services we want to use:

FROM countly/countly-server:16.06

# Add custom Countly configs - these in turn come from k8s volume
ADD ./runit/countly-api.sh /etc/service/countly-api/run
ADD ./runit/countly-dashboard.sh /etc/service/countly-dashboard/run

Example service

#!/usr/bin/env bash

cp /etc/config/api.js /opt/countly/api/config.js
chown countly:countly /opt/countly/api/config.js

exec /sbin/setuser countly /usr/bin/nodejs /opt/countly/api/api.js

_{countly-api.sh is almost exactly like the file we replaced.}

This service file is executed by runit and clobbers the default config each time. The file /etc/config/api.js is not actually permanently baked into the image but rather arrives via a Kubernetes configmap.

And here we've had to resort to a bit of a hack. ConfigMap backed volumes mounted as root is a known and open issue. At the moment there's no way to specify permissions. Hence, the chown line.

Decomposing apps into microservices for Kubernetes

We've completely gotten rid of the Nginx service countly bundles as edge routing can be done any number of ways elsewhere with Kubernetes.

Whether or not we split apart the dashboard app and the API server is not a question of convenience or style. The API server clearly maps to a replication controller and can be horizontally auto scaled with custom metrics (more on this later).

The dashboard app for our purposes has no high availability requirement and is rarely used. However, even when idle, it is taking up resources on the pod and this waste is multiplied across however many API servers we end up -- whereas we only need one dashboard app running at a time.

The clean way is to split it out further to one separate pod on the side.

As for mongo, both of these service contain a connection string we pass as a configmap like so:

mongodb: {
        host: "mongos.default",
        db: "countly",
        port: 27017,
        max_pool_size: 500,
    }

Separation of concerns

As a result, it is up to us to deploy and scale mongo separately from countly. Even if this particular mongo cluster is dedicated entirely to countly, and it should be, this separation of concerns is good for maintainability and resilience.

This decoupling is healthy. For example, a bug in one of the horizontally scaled countly API servers that causes a crash would not take a mongo pod along with it and thus the impact on overall performance is contained. Instead it will crash and burn on the side, the liveliness tests will fail, and Kubernetes in turn will transparently route away further requests to siblings while simultaneously launching a replacement.

graph showing how chaos-monkey style killing one of the countlies impacts overall writes, and for how long (fast recovery is cool)

InitContainers

Ask yourself, does it make sense for Countly pods to be running when there is no Mongo backend available to them? The answer is no. In fact, if Countly happens to start first the deployment becomes unpredictable.

Luckily init containers have reached beta status in the latest version of Kubernetes (1.4). In short, with this feature enabled the pod you normally specify alone now starts last in a blocking list.

pod:
  spec:
    containers: ...
    initContainers:
    - name: init-container1
      image: ...
      ...
    - name: init-container2
    Containers:
    - name: regularcontainer

init-container1 can have a simple command along the lines of nslookup monogos.default | -ge 3. This will fail as long as the mongo service is not up and running and passing its readyness and liveliness probes. Countly will be blocked from starting until the init container succeeds, exactly the desired behaviour.

Postmortem & Suggestions

Abstract

Kubernetes can run on a wide range of Cloud providers and bare-metal environments, and with many base operating systems.

Major releases roll out approximately every 15 weeks with requirements and dependencies still somewhat in flux. Additionally, although there are unit and integration tests that must be passed before a release, in a distributed system it is not uncommon that minor changes may pass all tests, but cause unforeseen changes at the system level.

There's a growing plurality of outdated and conflicting information on how to create a Kubernetes cluster from scratch. Furthermore, and perhaps most painfully, while a release can be assumed to be reasonably stable in the environment it was tested in ("works for me"), not many guarantees can yet be made about how things will work for a custom deployment.

What follows chronologically describes one beaten path to a custom cluster and some of the dos and don'ts accumulated from the false starts.

Picking a host operating system

_{_{"If you wish to make an apple pie from scratch, you must first invent the universe." -- Carl Sagan}}

Starting out on AWS one might be tempted to quick start from the web console and opt for Amazon Linux AMI. But that is not a portable choice. So at least for the sake of portability (perhaps in the future you'll want to run on another cloud provider, bare metal, or your laptop) it is better to opt for something like CentOS, Debian, or CoreOS.

This is not an easy choice and there is no universal answer. Each option brings along its own dependencies, problems, and bugs. But since choose we must we will go down the CentOS direction of this decision tree and see how far it takes us.

CentOS 7

Official CentOS images are provided for us on the AWS Marketplace.

To avoid ending up with deprecated AMI's and outdated images it is recommended to grab the AMI id programmatically (aws --region us-west-2 ec2 describe-images --owners aws-marketplace --filters Name=product-code,Values=aw0evgkw8e5c1q413zgy5pjce) and in case of a response with multiple ids pick the one with the most recent creation date as well as running yum update as the first step in your build process.

Default Docker is strongly discouraged for production use

Docker is not actually a hard requirement for Kubernetes, but this isn't about recommending alternative container runtimes. This is about the defaults being a hidden minefield.

The warning

What happens with the common yum install docker?

$ docker info

Containers: 0
Server Version: 1.10.3
Storage Driver: devicemapper
 Pool Name: docker-202:1-9467182-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 ..
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata

As you can see from the warning, the default Docker storage config that ships with CentOS 7 is not recommended for production use. Using devicemapper with loopback can lead to unpredictable behaviour.

In fact, to give a bit of a look into this dead end if we follow the path all the way to a Kubernetes cluster you will see nodes coming up like this:

systemctl --failed
  UNIT                         LOAD   ACTIVE SUB    DESCRIPTION
● docker-storage-setup.service loaded failed failed Docker Storage Setup
● kdump.service                loaded failed failed Crash recovery kernel arming
● network.service              loaded failed failed LSB: Bring up/down networking

What docker-storage-setup is trying (and failing) to do

docker-storage-setup looks for free space in the volume group of the root volume and attempts to setup a thin pool. If there is no free space it fails to set up a LVM thin pool and will fall back to using loopback devices. Which we are warned by docker itself is a strongly discouraged outcome.

Why this is a problem

This is insidious for several reasons. Depending on how many volumes your instance happens to spin up with (and how they're configured) you might never see this warning or experience any problem at all. For example if you have one hard-drive on bare-metal and no unallocated space this will always happen.

If the disk provisioning changes you might end up in this edge case but the cluster will still initially appear to be working. Only after some activity will xfs corruption in the docker image tree (/var/lib/docker) start to sporadically manifest itself and kubernetes nodes will mysteriously fail as a result.

Despite this being known as problematic for some time and documented, people still frequently run into this problem.

Incidentally yum install docker can result in slightly different versions of docker installed.

Each docker release has some known issues running in Kubernetes as a runtime.

So what's the recommended docker version? v1.12 or v1.11? It turns out the latest (v1.12) is not yet supported by Kubernetes v1.4.

The problem is a distribution like CentOS 7, officially supported by Kubernetes, by default will arbitrarily work for some but not others, with the full requirements hidden and underspecified.

At the very least Docker versions should be pinned together with OS and Kubernetes versions and a recommendation about the storage driver should be made.

To avoid these pitfalls, carefully select the storage driver

As Docker has a pluggable storage driver architecture and the default is (or might be) inappropriate, you must carefully consider your options. As discussed, getting this wrong will eventually cascade all the way to hard to debug and reproduce bugs and broken clusters.

Which storage driver should you choose? Several factors influence the selection of a storage driver. However, these two facts must be kept in mind:

No single driver is well suited to every use-case

Storage drivers are improving and evolving all of the time

The docker docs don't take a position either. If one doesn't want to make assumptions about how many disks a machine has (laptops, bare metal servers with one drive, 'etc) direct LVM is out.

AUFS was the original backend used by docker but is not in the mainline kernel (it is however included by Debian/Ubuntu).

Overlay is in mainline and supported as a Technology Preview by RHEL.

Additionally "Many people consider OverlayFS as the future of the Docker storage driver". It is the future proof way to go.

Overlay Dependencies

CentOS 7.2
"Only XFS is currently supported for use as a lower layer file system."
"/etc/sysconfig/docker must not contain --selinux-enabled" (for now)

With the above satisfied, to enable overlay simply:

echo "overlay" > /etc/modules-load.d/overlay.conf

And add the flag (--storage-driver=overlay) in the docker service file or DOCKER_OPTS (/etc/default/docker).

This requires a reboot, but first...

Properly configure netfilter

docker info had another complaint.

WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

This toggles whether packets traversing the bridge are forwarded to iptables. This is docker issue #24809 and could be ignored ("either /proc/sys/net/bridge/bridge-nf-call-iptables doesn't exist or is set to 0"). CentOS and most distros default this to 0.

If I was writing a choose your own adventure book this is the point I'd write that thunder rumbles in the distance, a quiet intensity.

If you follow this dead end all the way to a Kubernetes cluster you will find out that kube-proxy requires that bridged traffic passes through netfilter. So that path should absolutely exist otherwise you have a problem.

Furthermore you'll find that kube-proxy will not work properly with Weave on Centos if this isn't toggled to 1. At first everything will appear to be fine, the problem only manifests itself by kubernetes service endpoints not being routable.

To get rid of these warnings you might try:

echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables

echo 1 > /proc/sys/net/bridge/bridge-nf-call-ip6tables

This would toggle the setting but not persist after a reboot. Once again, this will cause a situation where a cluster will initially appear to work fine.

The above settings used to live in /etc/sysctl.conf the contents of which nowadays are:

# System default settings live in /usr/lib/sysctl.d/00-system.conf.
# To override those settings, enter new settings here, or in an /etc/sysctl.d/<name>.conf file

This file is sourced on every invocation of sysctl -p.

Attempting to toggle via sysctl -p gives the following error under certain conditions:

error: "net.bridge.bridge-nf-call-ip6tables" is an unknown key
error: "net.bridge.bridge-nf-call-iptables" is an unknown key

Since sysctl runs at boot there's also a very possible race condition if the bridge module hasn't loaded yet at that point. Making this a (sometimes) misleading error message.

The correct way to set this as of CentOS7:

$ cat /usr/lib/sysctl.d/00-system.conf


# Kernel sysctl configuration file
#
# For binary values, 0 is disabled, 1 is enabled.  See sysctl(8) and
# sysctl.conf(5) for more details.

# Disable netfilter on bridges.
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0

$ cat /usr/lib/sysctl.d/90-system.conf

net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1

This way systemd ensures these settings will be evaluated whenever a bridge module is loaded and the race condition is avoided.

Speaking of misleading error messages, kubernetes logs an incorrect br-netfilter warning on Centos 7:

proxier.go:205] missing br-netfilter module or unset br-nf-call-iptables; proxy may not work as intended

Stay the course, there's nothing else to toggle to make this warning go away, it is simply a false positive.

Consider disabling selinux

With Overlay as the storage backend currently you can only run with selinux on the host, a temporary limitation.

However, elsewhere, kubernetes uses a mechanism that injects special volumes into each container to expose service account tokens and with selinux turned on secrets simply don't work.

The work around is to set the security context of volume on the kubernetes host (sudo chcon -Rt svirt_sandbox_file_t /var/lib/kubelet) or set selinux to permissive mode.

Otherwise down the line kubernetes add-ons will fail or behave unpredictably. For example KubeDNS will fail to authenticate with the master and dns lookups on service endpoints will fail. (Slightly differs from the bridge netfilter disabled problem described above which results in routing by ip intermittently failing)

Since there might be other selinux permissions necessary elsewhere consider turning off selinux entirely until this is properly decided upon and documented.

Correct CNI config

Kubernetes supports CNI Network Plugins for interoperability. Setting up a network overlay requires this dependency.

Kubernetes 1.3.5 broke the cni config — as of that version it is necessary to pull in the cni release binaries into the cni bin folder.

As of Kubernetes 1.4 the flags to specify cni directories changed and documentation was added pinning the minimum cni version to 0.2 and at least the lo binary.

Other Dependencies

There's additional undocumented missing dependencies as follows:

conntrack-tools
socat
bridge-utils

AWS specific requirements & debugging

Peeking under the hood of Kubernetes on AWS you'll find:

All AWS resources are tagged with a tag named "KubernetesCluster", with a value that is the unique cluster-id. This tag is used to identify a particular 'instance' of Kubernetes, even if two clusters are deployed into the same VPC. Resources are considered to belong to the same cluster if and only if they have the same value in the tag named "KubernetesCluster".

This isn't only necessary to differentiate resources between different clusters in the same VPC but also needed for the controller to discover and manage AWS resources at all (even if it has an entire VPC to itself).

Unfortunately these tags are not filtered on in a uniform manner across different resource types.

A kubectl create -f resource.yaml successfully submitted to kubernetes might not result in expected functionality (in this case a load balancer endpoint) even when the desired resource shows as creating.... It will simply show that indefinitely instead of an error.

Since the problem doesn't bubble up to kubectl responses the only way to see that something is amiss is by carefully watching the controller log.

aws.go:2731] Error opening ingress rules for the load balancer to the instances: Multiple tagged security groups found for instance i-04bd9c4c8aa; ensure only the k8s security group is tagged

Reading the code yields:

// Returns the first security group for an instance, or nil
// We only create instances with one security group, so we don't expect multiple security groups.
// However, if there are multiple security groups, we will choose the one tagged with our cluster filter.
// Otherwise we will return an error.

In this example the kubernetes masters and minions each have a security group, both security groups are tagged with "KubernetesCluster=name". Removing the tags from the master security group resolves this problem as now the controller receives an expected response from the AWS API. It is easy to imagine many other scenarios where such conflicts might arise if the tag filtering is not consistent.

Smoke tests that simply launch and destroy a large amount of pods and resources would not catch this problem either.

Conclusion

The most difficult bugs are the ones that occur far away from their origins. Bugs that will slowly but surely degrade a cluster and yet sneak past continuous integration tests.

Additionally the target is a moving one. Minor releases of kubernetes can still have undocumented changes and undocumented dependencies.

If a critical Add-Ons fails seemingly identical clusters deployed minutes apart will have divergent behaviour. The cloud environments clusters slot into are also a source of state and therfore subtle edge case that can confuse the controller and silently prevent it from deploying things.

In short, this is a complex support matrix.

A possible way to improve things is by introducing:

A set of host OS images with minimal changes baked in as necessary for Kubernetes
- Continuously (weekly?) rebased on top of the latest official images
- As the basis for a well documented reference implementation of a custom cluster
Long running custom clusters spun up for each permutation of minor version updates (kubernetes version bump, weave, flannel, etcd, and so on)
A deterministic demo app/deployment as a comprehensive smoketest & benchmark

The community need to mix and match the multiple supported components with arbitrary necessary for custom deployments can be benefit from a set of "blessed" kubernetes-flavored host OS images and a more typical real-world artifact to check their customizations against.

demo's People

Contributors

Stargazers

Watchers

demo's Issues

CLA bot needs to show in progress status

@caniszczyk I watched the PR that @bgrant0607 just submitted, and I'm concerned that the CLAbot is not quite configured correctly.

The issue is that when creating the pull request, the status API should show in progress from the CLAbot, to let the submitter know that it's running. Then, it should change to passed or failed.

Right now, it seems to show as green and then come back as pass or fail a minute later.

Benchmark design proposal

Note: This is very tentative and is pending discussion.

Egress

HTTP Load is generated with WRK Pods, scriptable via Lua, and auto scaled by increments of 1000 rps (Each pod makes one thousand concurrent requests), up to 100 pods max.

WRK pods are pinned with the affinity mechanism to set of nodes A.

Given that this works out to under 100MB/s of traffic in aggregate -- this is not a high bar -- carefully picking the right instance type we can dial it in so this requires the predetermined amount of nodes of our choosing.

Its a good idea to have the number of nodes equal (or be a multiple of) the number of availability zones in the region the benchmark will run in. For 'us-west-2' that would be three.

Instance type selection is with the intention of picking the smallest/cheapest type that is still beefy enough to generate enough load with just those three nodes.

Ingress

Countly API pods are similarly pinned to set of nodes B. Again, as few as three nodes (with one pod per) are required. This provides redundancy but also mirrors the Egress described above and thus controls for variance in traffic between pods in different availability zones.

The autoscaling custom metric

The WRK pods report summaries with latency statistics and and failed requests.
Its possible to latch unto this error rate to provide the custom metric.

It seems at the moment this requires a bit of trickery, for example:

Pod1: 1000 requests made, 12 time outs, 31 errors
Pod2: 1000 requests made, 55 time outs, 14 errors
Pod3: 1000 requests made, 32 time outs, 55 errors

The autoscaler is actually provided a target.
Assuming we want to tolerate no more than a 10% bad requests (errors + timeouts) we'd provide a target of 100.

Based on the above the autoscaler will keep launching additional pods, the load will increase, so will the error rates and time outs, until an equilibrium is reached.

The backend (Mongo cluster)

Mongo pods are pinned to set of nodes C. These are CPU, Memory, and Disk I/O intensive. The number of these pods and nodes is fixed.

The background job (Boinc)

Boinc pods are pinned to nodes A,B but not C. They are scaled to soak up available CPU on these nodes which will otherwise be under utilized.

Consider portable local cluster

@Zilman @leecalcote We were joking about you having a cluster that could fit in your backpack, but this one would probably speed up your debug cycles versus the latency of working with the cloud.

https://hackernoon.com/diy-kubernetes-cluster-with-x86-stick-pcs-b0b6b879f8a7
https://news.ycombinator.com/item?id=12985525

I will fund this as one of your deployment targets if you're interested.

Container execution checks utility for use with InitContainers/readiness/liveness probes (proposal)

Handling initialization is typically shown with simple commands such as 'wget' or 'cat' and is rather straightforward.

However, for non trivial conditionals this can get hairy.

A contrived example

Consider an InitContainer that succeeds when a service responds with 3 DNS endpoints.
At first glance it is a simple nslookup servicename -ge 3 one liner. That is until you happen to use an image that doesn't bundle nslookup so you'd getent hosts servicename -ge 3 instead.

Writing bash one liners is suboptimal

What utilities can one safely rely on for the one-liner munging?
No sane style guide
Maintainability

In reality past the simple one liner people should (and do) reach for the scripting language of their choice. However, now you went from a tiny busybox InitContainer to a 300MB container that bundles python to avoid writing a little bash.

The executing checks do belong in the project yaml/json file instead of being baked into some one-off image on the side. Most of these checks for most projects probably fall into two dozen or some common patterns.

So I purpose a utility in the spirit of bc.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  annotations:
    pod.alpha.kubernetes.io/init-containers: '[
        {
            "name": "install",
            "image": "busybox",
            "command": ["k", "service", "name", "at least", "3"]
        }
]'

To be written in Go, with a small core, and extensible (so users can add custom checks via a volume).

Single command load generation (Countly, Boinc)

Currently countly goes from single command to available to ready to use after a few minutes.
We want to pepper over the manual steps left in setting it up and triggering WRK load against it.

Boinc is just a regular background job, nothing needed there.

Fluentd Use Cases

Outside Kubernetes

Systemd logs archiving to S3

For lack of somewhere better to send to at the moment we can start by simply archiving these logs, particularly kube-apiserver with increased verbosity could end up being useful.

The default is hourly which is not actionable for our short running demo. Going to try a 1 minute resolution.

Forward Output Plugin

Since we're going to have a small dedicated server (#145) to save demo runs and dashboards, we could include fluentd on it and use the forwarding plugin.

Within Kubernetes

Best way of collecting all the Kubernetes logs (from pods) via something like fluentd is under active discussion (kubernetes/kubernetes/issues/24677), still researching the way forward on this.

Resource Quality of Service in Kubernetes

Background reading:

https://github.com/kubernetes/kubernetes/blob/master/docs/devel/scheduler_algorithm.md
https://github.com/kubernetes/kubernetes/blob/master/docs/design/resource-qos.md

This issue is to discuss what were dealing with in the context of the demo.

Ansible Provisioner via Packer can't connect to instance

When run as:
packer build -debug

It's possible to successfully connect with the generated temporary key over plain old ssh as well as do an ansible ping (ansible all -i 52.40.138.235, -m ping --user=centos --key-file=ec2_amazon-ebs.pem).

However when packer gets to the point where it starts the ansible provisioner it fails to connect.

==> amazon-ebs: Waiting for SSH to become available...
==> amazon-ebs: Connected to SSH!
==> amazon-ebs: Provisioning with Ansible...
==> amazon-ebs: SSH proxy: serving on 127.0.0.1:60344
==> amazon-ebs: Executing Ansible: ansible-playbook /Users/Gene/Projects/k8s/cnfn/demo/Kubernetes/Bootstrap/Ansible/playbooks/setup/main.yml -i /var/folders/ws/2xpp7b5n3vj5h69xf79hqfnw0000gn/T/packer-provisioner-ansible009016913 --private-key /var/folders/ws/2xpp7b5n3vj5h69xf79hqfnw0000gn/T/ansible-key242768954
amazon-ebs:
amazon-ebs: PLAY [all] *********************************************************************
amazon-ebs:
amazon-ebs: TASK [setup] *******************************************************************
amazon-ebs: SSH proxy: accepted connection
==> amazon-ebs: authentication attempt from 127.0.0.1:60345 to 127.0.0.1:60344 as centos using none
==> amazon-ebs: unauthorized key
==> amazon-ebs: authentication attempt from 127.0.0.1:60345 to 127.0.0.1:60344 as centos using publickey
==> amazon-ebs: authentication attempt from 127.0.0.1:60345 to 127.0.0.1:60344 as centos using publickey
==> amazon-ebs: starting sftp subsystem
amazon-ebs: fatal: [default]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to the remote host. Make sure this host can be reached over ssh", "unreachable": true}
amazon-ebs: to retry, use: --limit @/Users/Gene/Projects/k8s/cnfn/demo/Kubernetes/Bootstrap/Ansible/playbooks/setup/main.retry
amazon-ebs:
amazon-ebs: PLAY RECAP *********************************************************************
amazon-ebs: default : ok=0 changed=0 unreachable=1 failed=0
amazon-ebs:
==> amazon-ebs: shutting down the SSH proxy
==> amazon-ebs: Terminating the source AWS instance...

CPU spike when pulling big containers can kill nodes & the whole cluster

A Very Large Container can cause a huge CPU spike.

This is hard to pin point exactly, could be just docker pull working very hard, a kubelet bug, or something else.

Cloudwatch doesn't quite capture how bad this is, nodes freeze up to the point where you can't ssh into them. Everything becomes totally unresponsive, 'etc. Eventually (after 7 minutes in this case) it finally revs down and recovers. Except the Weave pods. Now the cluster is shot.

kubectl delete -f https://git.io/weave-kube, kubectl apply -f https://git.io/weave-kube does not help.

kubectl logs weave-net-sbbsm --namespace=kube-system weave-npc

..
time="2016-11-17T04:16:44Z" level=fatal msg="add pod: ipset [add weave-k?Z;25^M}|1s7P3|H9i;*;MhG 10.40.0.2] failed: ipset v6.29: Element cannot be added to the set: it's already added\n: exit status 1"

To be fair, the nodes are t2.micro and have handled everything so far. Perhaps this is their natural limit, retrying with larger instances.

Integrate CoreDNS

Need to update demo app to include CoreDNS usage.

Integrate GRPC

Need to update demo app to include GRPC usage.

Iptables/MASQUERADE support is insufficient

kubernetes/kubernetes#17084, kubernetes/kubernetes#11204, kubernetes/kubernetes#20893, kubernetes/kubernetes#15932

It appears like this is the explanation for two related but separate show stopping problems.

One problem is that kubedns eventually forwards to the outside resolver, in AWS this would be something like 172.20.0.2 -- and AWS apparently doesn't respect traffic coming from a different subnet. Since the request is originating from a pod within an overlay network with an IP of '10.x.x.x' it hangs.

The second problem is that internal resolving via kubedns works when you lookup directly against the kubedns pod. nslookup kubernetes.default <ip-of-kubednspod> works.

However routing to the kubedns service is broken. It seems for some environments and overlay settings the iptables kube-proxy writes are not quite right.

OpenTracing

As first steps it would be nice to instrument cncfdemo-cli, as it is a short & simple python script.
Appdash could be containerized and deployed unto a cluster as part of the demo to receive traces.

One open question currently is that Appdash will become available a few minutes _after_ the script is started. Hopefully its possible to patch the remote controller endpoint on the fly somehow or do some trickery.

Intel Cluster Usage -- reset in 48 hours

If you have access to one of the 20 nodes allocated to CNCF please be advised that I'm going to wipe those instances soon and save your work.

Also fess up to installing Anaconda when I wasn't looking. :)

Custom VPC domains trip up Controller

See: kubernetes/kubernetes#23371

Stumbled across this by simply using a none default DHCP Option Set in my VPC.

Get distcc compiles working

Specifically:

pump mode
reliably
instrumented

skip-tags doesn't work for handlers

https://groups.google.com/forum/#!msg/ansible-project/qXWpCJ1449E/Bh5XeCPB0MAJ

Unfortunate.

Specifically, one use case is when baking an image you might want to setup a service but stop just shy of starting it for the first time. Ansible service restart handlers really should support tags which in this instance are much more elegant than conditionals.

Boolean values in YAML lead to unexpected behaviour

kubectl label nodes <nodename> echo=true

rc.yaml:

spec:
   nodeSelector:
      echo: true
   containers:

$ kubectl create -f rc.yaml --validate
unable to decode "rc.yaml": [pos 434]: json: expect char '"' but got char 't'

The same thing happens if true is set to yes. When set to anything else like foo it succeeds with no complaints. Kind of surprising if you try yes first.

A simple build system

After a period of lots of little tweaks and modifications to nail down a bug the AMI count passed 100. That's getting a bit out of hand.

The workflow is to cd into the ansible directory of this repository, do packer build packer.json, wait 15 minutes for the process to run through, manually alter which AMI is referenced in the bootstrap scripts, deploy a new cluster, and manually poke around testing things out -- sometimes to only discover some minor thing is broken too late for comfort.

We're past the point of needing proper tests. And its beneficial at this point to automate some things to speed the process up.

So to get this going a simple build system is necessary. The general idea is to setup github hooks to kick off packer builds instead of doing it from a laptop. The 15 minutes will hopefully be cut down a bit, but even better is having a record of it happening.

An implementation detail

It seems wasteful to always keep a server up to listen for build hooks so the hook should instead trigger a lambda that will either forward it to the build server or notice its not up and turn it back on (via Autoscaling group of size 1 with scheduling to scale down to 0 when commits usually don't happen).

Getting the logs back out again

Remote Journal Logging - "Systemd journal can be configured to forward events to a remote server. Entries are forwarded including full metadata, and are stored in normal journal files, identically to locally generated logs. This can be used as an alternative or in addition to existing log forwarding solutions."

Where the logs are forwarded to and how they are persisted is being explored.

Benefits

Speed up debugging
Significantly reduce the need to SSH into individual members of a cluster
Easy reference and sharing of logs
Possibly write smoke tests against the logs

Intermittent responses for Kubernetes service endpoints (postmortem)

Follow up to #63.

Beginning of Problems

At some point in time a known-good deployment stopped succeeding on newly created clusters. This was caused by several disparate issues across several versions/configurations/components.

Init containers would not progress because service availability checks would fail
A service would appear to exist (kubectl get svc) and point at pods with correct endpoints (kubectl describe service)
Attaching to pods directly for inspection would show them operating as expected
Sometimes parts would succeed, but not uniformly and with no clear pattern

The first step to check if a service is working correctly is actually a simple DNS check (nslookup service). By chance, this would often appear to be functioning as expected indicating the problem must be elsewhere (not necessarily with kubernetes).

However, not to bury the lead: running nslookup on a loop would later expose that it was timing out sporadically. That is the sort of thing that makes a bug sinister as it misdirects debugging efforts away from the problem.

Known KubeDNS Issues Encountered

Secrets volume & SELinux permissions

SELinux context was missing 'svirt_sandbox_file_t' on the secretes volume and therefore from the perspective of the KubeDNS pod /var/run/secrets/kubernetes.io/serviceaccount/ was mangled and it couldn't in turn use that to connect to the master.
Secrets volume got stale

The kube-controller is responsible for injecting the secrets volume into pods and keeping it up to date. There were/are known bugs where it would fail to do that. As a result KubeDNS would mysteriously stop working because its tokens to connect to the master had grown stale. (This sort of thing: kubernetes/kubernetes#24928)
Typo

official skydns-rc.yaml had a typo at some point with --domain= missing the trailing dot.
Scalability

It is now recommended to scale KubeDNS pods proportionally to number of nodes in a cluster.

These problems would crop up and get resolved yet errors would stubbornly persist.

kubectl logs $(kubectl --namespace=kube-system get pods | tail -n1 | cut -d' ' -f1) --namespace=kube-system --container kubedns

I0829 20:19:21.696107       1 server.go:94] Using https://10.16.0.1:443 for kubernetes master, kubernetes API: <nil>
I0829 20:19:21.699491       1 server.go:99] v1.4.0-alpha.2.1652+c69e3d32a29cfa-dirty
I0829 20:19:21.699518       1 server.go:101] FLAG: --alsologtostderr="false"
I0829 20:19:21.699536       1 server.go:101] FLAG: --dns-port="10053"
I0829 20:19:21.699548       1 server.go:101] FLAG: --domain="cluster.local."
I0829 20:19:21.699554       1 server.go:101] FLAG: --federations=""
I0829 20:19:21.699560       1 server.go:101] FLAG: --healthz-port="8081"
I0829 20:19:21.699565       1 server.go:101] FLAG: --kube-master-url=""
I0829 20:19:21.699571       1 server.go:101] FLAG: --kubecfg-file=""
I0829 20:19:21.699577       1 server.go:101] FLAG: --log-backtrace-at=":0"
I0829 20:19:21.699584       1 server.go:101] FLAG: --log-dir=""
I0829 20:19:21.699600       1 server.go:101] FLAG: --log-flush-frequency="5s"
I0829 20:19:21.699607       1 server.go:101] FLAG: --logtostderr="true"
I0829 20:19:21.699613       1 server.go:101] FLAG: --stderrthreshold="2"
I0829 20:19:21.699618       1 server.go:101] FLAG: --v="0"
I0829 20:19:21.699622       1 server.go:101] FLAG: --version="false"
I0829 20:19:21.699629       1 server.go:101] FLAG: --vmodule=""
I0829 20:19:21.699681       1 server.go:138] Starting SkyDNS server. Listening on port:10053
I0829 20:19:21.699729       1 server.go:145] skydns: metrics enabled on : /metrics:
I0829 20:19:21.699751       1 dns.go:167] Waiting for service: default/kubernetes
I0829 20:19:21.700458       1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I0829 20:19:21.700474       1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I0829 20:19:26.691900       1 logs.go:41] skydns: failure to forward request "read udp 10.32.0.2:49468->172.20.0.2:53: i/o timeout"

Known Kubernetes Networking Issues Encountered

Initial Checks

Kubernetes imposes the following fundamental requirements on any networking implementation:

all containers can communicate with all other containers without NAT

all nodes can communicate with all containers (and vice-versa) without NAT

the IP that a container sees itself as is the same IP that others see it as

_{- Networking in Kubernetes}

In other words, to make sure networking is not seriously broken/misconfigured check:

Pods are being created / destroyed
Pods are able to ping each other

At first blush these were looking fine, but pod creation was sluggish (30-60 seconds), and that is a red flag.

Missing Dependencies

As described in #62, at some version CNI folder started missing binaries.

More undocumented dependencies (#64) were found from staring at logs and noting weirdness.
The real important ones are (conntrack-tools, socat, bridge-utils), these things are now being pinned down upstream.

The errors were time consuming to understand because often their phrasing would leave something to be desired. Unfortunately there's at least one known false-positive warning (kubernetes/kubernetes#23385).

Cluster CIDR overlaps

--cluster-cidr="": CIDR Range for Pods in cluster.
--service-cluster-ip-range="": CIDR Range for Services in cluster.

In my case services got a /16 starting on 10.0.0.0, the cluster-cidr got a 16 on 10.244.0.0.
The service cidr is routable because kube-proxy is constantly writing iptable rules on every minion.

For Weave in particular --ipalloc-range needs to be passed to exactly match what's given to the Kubernetes cluster-cidr.

Whatever your network overlay, it must not clobber the service range!

Iptables masquerade conflicts

Flannel

If using Flannel be sure to follow the newly documented instructions:
DOCKER_OPTS="--iptables=false --ip-masq=false"

Kube-proxy makes extensive use of masquerading rules, similar to an overlay clobbering the service range, another component (like the docker daemon itself) mucking about with masq rules will cause unexpected behavior.

Weave

Weave was originally erronously started with --docker-endpoint=unix:///var/run/weave/weave.sock which similarly caused unexpected behavior. This flag is extraneous and has to be omitted when used with CNI.

Final Configuration

Image

Centos7 source_ami: ami-bec022de

Dependencies

SELinux disabled.

Yum installed:

docker
etcd
conntrack-tools
socat
bridge-utils

kubernetes_version: 1.4.0-alpha.3
(b44b716965db2d54c8c7dfcdbcb1d54792ab8559)

weave_version: 1.6.1

1 Master (172.20.0.78)

Gist of journalctl output shows it boots fine, docker, etcd, kube-apiserver, scheduler, and controller all start. Minion registers successfully.

$ kubectl  get componentstatuses

NAME                 STATUS    MESSAGE              ERROR
scheduler            Healthy   ok
controller-manager   Healthy   ok
etcd-0               Healthy   {"health": "true"}

$ kubectl get nodes 

NAME                                        STATUS    AGE
ip-172-20-0-18.us-west-2.compute.internal   Ready     1m

1 minion (172.20.0.18)

$ kubectl run -i --tty --image concourse/busyboxplus:curl dns-test42-$RANDOM --restart=Never /bin/sh

Pod created (not sluggishly). Multiple pods can ping each other.

Weave

Weave and weaveproxy are up and running just fine.

$ weave status

Version: 1.6.0 (version 1.6.1 available - please upgrade!)

        Service: router
       Protocol: weave 1..2
           Name: ce:1a:4b:b0:07:6d(ip-172-20-0-18)
     Encryption: disabled
  PeerDiscovery: enabled
        Targets: 0
    Connections: 0
          Peers: 1
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 10.244.0.0/16
  DefaultSubnet: 10.244.0.0/16

        Service: proxy
        Address: unix:///var/run/weave/weave.sock

$ weave status ipam

ce:1a:4b:b0:07:6d(ip-172-20-0-18)        65536 IPs (100.0% of total)

Conclusion

Kubernetes is rapidly evolving with many open issues -- there are now efforts upstream to pin down and document the dependencies along with making errors and warnings more user-friendly in the logs.

As future versions become less opaque knowing which open issue is relevant to your setup will become easier. Along with whether an obvious dependency is missing and what a good setup looks like.

The nominal sanity check command that currently exists (kubectl get componentstatuses) does not go far enough. It might show everything is healthy. Pods might be successfully created. Services might work.

And yet these can all be misleading as a cluster may still not be entirely healthy.

A useful test I found in the official repo simply tests connectivity (and authentication) to the master. Sluggishness is not tested and sluggishness it turns out is a red flag.

In fact, there's an entire folder of these, but they are not well documented as far as I can tell.

I believe a smoke test that can deployed against any running cluster and run through a suite of checks and benchmarks (to take into account unexpectedly poor performance) would significantly improve the debugging experience.

Boto vs Terraform

Boto3 is the Amazon Web Services (AWS) SDK for Python.

Terraform is a higher level abstraction that provides additional features like multi-cloud support, complex changesets, and so on.

Some people have embraced Terraform wholeheartedly already, however there are cons.

While multi-cloud support is an excellent feature for the people who require it - it is not currently in scope for this demo. This might change, at which point the cost-benefit of using Terraform will have to be reevaluated.

The multi-cloud support is a leaky abstraction, going up the ladder of abstraction in this case means learning yet another configuration syntax and introducing yet another dependency on yet another tool.

The union set of users who can use Terraform and not be familiar with the underlying AWS API's that Boto exposes is approximately zero. Furthermore, its worthwhile to consider that virtually all users of AWS are allowed to use the official SDK (and probably already have it configured) but not all are allowed (or capable, at least in a timely manner) of using Terraform.

There's something to be said for avoiding the situation of: "to try out our demo, first install and understand a separate project".

Finally, Terraform predates Boto3, which is significantly improved and simplified, and sprinkles in some of the higher levelness.

As a result, for our limited use case, one can get most of the pros of terraform sans the cons.

Execution Plans in a few lines of Python

Lets define a sequence of named tuples.

The tuple consists of an object, method name, and arguments, respectively.

bootstrap = [(EC2, 'create_key_pair', {'KeyName': keyname}),
             (IAM, 'create_instance_profile', {'InstanceProfileName': instanceprofile}),
             (IPE, 'wait', {'InstanceProfileName': instanceprofile}),
             (IAM, 'create_role', {'RoleName': rolename, 'AssumeRolePolicyDocument': json.dumps(TrustedPolicy)}),
             (IAM, 'add_role_to_instance_profile', {'RoleName': rolename, 'InstanceProfileName': instanceprofile}),
             (IAM, 'attach_role_policy', {'RoleName': rolename, 'PolicyArn': policyarn})]

There's absolutely no mystery here.

import boto3
EC2 = boto3.resource('ec2')
IAM = boto3.resource('iam')

You can see the method names are lifted directly from the Boto3 API Documentation along with the arguments which are typically json dictionaries.

One can simply copy-paste from the docs the json blobs and reference back and forth to understand exactly what is going on.

If we print out just the method column we get something that scans beautifully:

create_key_pair
create_instance_profile
wait
create_role
..
create_launch_configuration
create_auto_scaling_group

Which just reads as plain English and such a laundry list can be found in any number of tutorials and blog posts that enumerate the steps for creating an auto scaling group.

There are no other complex steps, the only thing modified after these things are provisioned is the scaling factor of the group.

There are no complex dependencies or teardowns as we create a dedicated vpc and blow it away completely per cluster each time -- we never edit an existing deployment as you would in a production environment and thus the raison d'etre of terraform and other such tools - complex changesets - is not relevant.

In short, this seems like a good sweet spot to ramp users unto a kubernetes cluster deployment process without unnecessary indirection steps.

It's entirely a Terraform recipe will be added future and primarily used, but the vanilla way should definitely come first and be supported.

Failed job garbage collection

kubectl get pods -a

This list becomes a bit messy sometimes.

Leaving a note here to consider tweaking clean ups based on these settings:
http://kubernetes.io/docs/admin/garbage-collection/

Low priority.

Repeated creation/deletion of resources breaks cluster

A successful run from bootstrapping a cluster to provisioning its needed resources all in one go has been happening for some time now.

With that being said, the development experience is one of tinkering.

kubectl create -f
kubectl delete -f
kubectl create -f

And so on. Sometimes mysterious bugs pop up, connectivity issues, resources not addressable, time is spent chasing it.. a full cluster shutdown and trying again from scratch often ends up fixing everything.

The problems are usually around Kubernetes services getting 'confused' or 'sticky'. Additionally, sometimes you see a pod get stuck in 'Terminating' or an inability to attach to running pods because the docker process on the Node crashed.

So in short, small seemingly harmless actions can currently bring down a whole cluster.

Missing Features for Complex Deployments

kubernetes/kubernetes#1899 - Allow users to wait for conditions from kubectl and using the API
kubernetes/kubernetes#1542 - Support master election (think, building blocks for quorum based clusters like Mongo)

https://github.com/cncf/demo/blob/master/Kubernetes/Docs/Advanced.md

Python2 vs Python3 for client side scripts

As part of the demo there will be several scripts that can run on the client side.

Anything from a tiny utility that exposes a k8s service endpoint via route53 as a nice human friendly subdomain to a full fledged deployment automation script that rolls out and sets up the various resources that run on the cluster.

One way this question could be made irrelevant is by simply running all of that out of a so called SideCar container. However there might be something to be said for letting the user run and play with it natively.

Opinions are welcome whether its acceptable to have python3 as a requirement.

boinc unexpected behavior -- warnings, daily quota

27-Jul-2016 18:22:43 [---] Resuming computation
27-Jul-2016 18:22:46 [http://www.worldcommunitygrid.org/] Master file download succeeded
27-Jul-2016 18:22:51 [http://www.worldcommunitygrid.org/] Sending scheduler request: Project initialization.
27-Jul-2016 18:22:51 [http://www.worldcommunitygrid.org/] Requesting new tasks for CPU
27-Jul-2016 18:22:55 [World Community Grid] Scheduler request completed: got 0 new tasks
27-Jul-2016 18:22:55 [World Community Grid] No tasks sent
27-Jul-2016 18:22:55 [World Community Grid] No tasks are available for OpenZika
27-Jul-2016 18:22:55 [World Community Grid] No tasks are available for the applications you have selected.
27-Jul-2016 18:22:55 [World Community Grid] This computer has finished a daily quota of 5 tasks

That's not quite expected, ran into it because I kept iterating from my machine. Apparently if you start and stop a lot the client gets a temporary ban.

Also:

dir_open: Could not open directory 'slots' from '/var/lib/boinc-client'.

The only evidence of this issue on google suggests its a permissions thing. Also had this with the projects directory and as a result I'm doing:

mkdir -p /var/lib/boinc-client/projects/www.worldcommunitygrid.org && chown -R boinc:boinc /var/lib/boinc-client

boinc client needs mice device

https://setiathome.berkeley.edu/forum_thread.php?id=79537

/dev/input/mice is being used by boinc for entropy or something -- assuming X by boinc maintainers.. is a problem.

Grafana output discussion

Row By Row

At the top there are two gauge views with spark charts in their background.
Left gauge is memory and right is CPU and totals underneath. This is meant as a quick cluster wide resource overview on the node level.
Pod CPU shows the avg and current cpu utilization percentage per pod.
System Pod CPU, this is all the "system" pods like kubedns, weave, node-exporter, it should under normal conditions be flat and low so it is somewhat greyed out.
Pods Memory (MB)
System Pods Memory (MB)
Pods Network I/O - shows ingress and egress

Influxdb-grafana addon

Background

Influxdb is a timeseries data store.
Grafana is a webapp that visualizes time series data and allows the creation of custom dashboards. It has support for several data sources, including influxdb (and more recently, Prometheus).

There are several slightly different versions of this pairing as a Kubernetes AddOn:

Neither is quite right as these examples combine what should be two separate deployments or replication controllers into one.

As a result the following race condition occurs:

Grafana container starts and is configured with a data source pointing to the influxdb service.
It fails to connect as the Influx pod is either not started or not ready yet.
As a result the dashboards are all blank with a warning marker.
The influxdb container starts.

At this point you can visit the Grafana UI and in the data source settings simply hit 'test connection'.

This forces a refresh and now data shows up in grafana and things work as expected.

1.3.5 missing CNI binaries

So up until now only having weave binaries in /opt/cni/bin/ worked and was documented as right way.

With 1.3.5 it silently stops working and after much head scratching and furious googling the only mention of this problem is here: kubernetes/kubernetes#30681

Turns out one has to manually pull in 0.3.0 of https://github.com/containernetworking/cni/releases into /opt/cni/bin. Repeating that info here for visibility to spare the next fellow who gets stuck on that.

Grafana roundtrip json export/import bug

Unfortunately I can reproduce this issue reliably: grafana/grafana#2816

Create dashboard in Grafana 3
Export json file

Import json file manually (via Grafana web UI) works.
Import via API does not despite making the exact same API calls.

In other words, there's some additional json munging required, can't submit dashboard file as-is.
This is an inconvenience because now a hackish post processing step is necessary for demo outputs.

Note: the really odd thing is that backwards compatibility with Grafana 2 is good so dashboards exported from that are imported correctly both ways.

Docker builds intermittently failing due to apt cache issues

This is a pretty well known problem.

Docker caches aggressively so a common first line in your Dockerfile like apt update will not run each time and eventually the mirror list will become stale and builds will fail.

Possible workaround #1

docker build --no-cache

Possible workaround #2

RUN apt-get clean && apt update

While the above helps with stale mirrors it does not help with unresponsive/slow/broken ones.

FROM debian:stable

RUN apt-get clean && apt update && apt install -y
RUN apt install -y kernel-package

This is taking a very long time today but eventually completes despite appearing to be frozen. Building large containers (this one ends up being 1GB in size, ouch) from a laptop is not a good workflow.

Azure support

Centos 7 kdump service failed

systemctl --failed
  UNIT                         LOAD   ACTIVE SUB    DESCRIPTION
● docker-storage-setup.service loaded failed failed Docker Storage Setup
● kdump.service                loaded failed failed Crash recovery kernel arming
● network.service              loaded failed failed LSB: Bring up/down networking

systemctl status kdump.service
● kdump.service - Crash recovery kernel arming
   Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2016-10-24 18:58:21 UTC; 49s ago
  Process: 812 ExecStart=/usr/bin/kdumpctl start (code=exited, status=1/FAILURE)
 Main PID: 812 (code=exited, status=1/FAILURE)

Oct 24 18:58:21 ip-172-31-18-67.us-west-2.compute.internal systemd[1]: Starting Crash recovery kernel arming...
Oct 24 18:58:21 ip-172-31-18-67.us-west-2.compute.internal kdumpctl[812]: No memory reserved for crash kernel.
Oct 24 18:58:21 ip-172-31-18-67.us-west-2.compute.internal kdumpctl[812]: Starting kdump: [FAILED]
Oct 24 18:58:21 ip-172-31-18-67.us-west-2.compute.internal systemd[1]: kdump.service: main process exited, code=exited, status=1/FAILURE
Oct 24 18:58:21 ip-172-31-18-67.us-west-2.compute.internal systemd[1]: Failed to start Crash recovery kernel arming.
Oct 24 18:58:21 ip-172-31-18-67.us-west-2.compute.internal systemd[1]: Unit kdump.service entered failed state.
Oct 24 18:58:21 ip-172-31-18-67.us-west-2.compute.internal systemd[1]: kdump.service failed.

Just started happening, there's a somewhat newer centos7 AMI so going to that.

Kubelet failed to detect running docker process when name is not `docker`

v1.2 bug: #26259

This bit of indirection was surprising to find so I'm keeping a note here for reference. Should anybody run into this the workaround in the issue linked above works.

Seems to be resolved in 1.3.

Odd little Jinja template / Yaml bug

Template file: demo/Kubernetes/API/example.yaml.j2#L81

You can combine multiple yaml files into one with the '---' document separator. Standard yaml fare, common pattern with kuberentes deployment yaml files.

I've jinja templated this one and for some reason if I don't end with a final (and it seems to my eyes unnecessary) terminating '---' in a block, the generated yaml becomes a bit weird.

It converts all the documents during the loop, except the last document (the "Job" one in this instance), that somehow gets clobbered so there's only one of it, from the last iteration only.

The '---' is a hackish thing and now the associated code that uses the generated yaml needs to make sure to throw away empty docs.

Demo output (version 0.1) - Show initial text results, images saved in folder

Save the grafana png's as regular files instead of base64 json, dump in folder.

Text results, finally something we know to want to pull:

Time from 'cluster ready' until distcc finished the compile
CPU utilization percentages per node throughout this time
Number of requests Countly handled meanwhile

ConfigMap backed volumes mounted as root

Ref: kubernetes/kubernetes#2630, kubernetes/kubernetes#11319.

So lets say you have some configuration files and you throw them into a ConfigMap.
The pod you spin up mounts a volume and these files happily appear in, for example: /etc/config.

Great. Except that volume was mounted as root and your app demands different permissions to read those files.

The workaround suggested so far is to have a wrapper script do chown'ing -- clearly hackish.

Kick off distcc on linux stable commits

Poll https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/ and pull on change

More undocumented missing dependencies

kubernetes/kubernetes#26093 is just a treasure trove I wish I'd seen before!

I've been finding and adding packages in one at a time, for example:

[Error configuring cbr0: exec: "brctl": executable file not found in $PATH]

That's because brctl is part of bridge-utils, something you have to pull in. And so on. This list was sorely needed.

Benchmarking time to deploy

Empirically, it'd be great to have a measurement of deployment time (or goal for) deployment of the demo project. In other words, identify how many minutes it typically takes someone to deploy k8s, prometheus and Countly using the cncfdemo.

Proper Job fields via the REST API

kubernetes/kubernetes#23599

So for historical reference, in your JSON:

            "spec": {
                    "autoSelector": true,
                    ..
             }

And also you at the top you must:

"apiVersion": "extensions/v1beta1"

Prometheus resource usage "observer effect"

Prometheus has been performing admirably with the demo for a while now - the amount of points written to it is relatively small despite the varied workload. So this was as expected.

Turns out according to prometheus/prometheus#455 the amount of resources it uses is not just bound by how much is written into it but also how much it is queried. For instance, if you open Grafana in a dozen tabs you can see memory starting to climb (it's a heavy dashboard).

The demo also recently added a sidecar that logs info from Prometheus to a cncfdemo backend, this increased the amount of resources used -- obvious in retrospect.

Finally, until now prometheus was just deployed as a regular pod in a 'monitoring' namespace, so it would end up on a random node. Including memory constrained nodes (the demo overloads some nodes by design). This causes a sort of observer effect and occasionally skews the results in a pronounced and strange way.

The obvious conclusion is to pin Prometheus and other crucial infra pods to some reserved nodes with plenty of headroom.

Doc: Two links to graphics in README.md broken

Update: actually three graphics links

Under Three Groups:
- Broken link 1 should be https://github.com/cncf/demo/blob/master/docs/arch.png
- Broken link 2 should be https://github.com/cncf/demo/blob/master/docs/k8s-simpler.png
Under Software-defined networking for minions:
- Broken link 3 should be https://github.com/cncf/demo/blob/master/docs/sdn.png

Host the grafana snapshots on a dedicated server to the side

Integrate LinkerD

Need to update demo app to include LinkerD usage.

Kubernetes AWS problems with multiple security groups due to tags

kubernetes/kubernetes#23339, kubernetes/kubernetes#26787

The Kubernetes Controller manages AWS resources by filtering on aws resource tags like KubernetesCluster:ClusterName. Unfortunately it does this inconsistently for different things.

8527    2292 log_handler.go:33] AWS request: elasticloadbalancing DescribeLoadBalancers
3961    2292 aws_loadbalancer.go:191] Deleting removed load balancer listeners
4035    2292 log_handler.go:33] AWS request: elasticloadbalancing DeleteLoadBalancerListeners
1501    2292 aws_loadbalancer.go:203] Creating added load balancer listeners
1592    2292 log_handler.go:33] AWS request: elasticloadbalancing CreateLoadBalancerListeners
3129    2292 log_handler.go:33] AWS request: elasticloadbalancing DescribeLoadBalancerAttributes
3214    2292 log_handler.go:33] AWS request: elasticloadbalancing ModifyLoadBalancerAttributes
4591    2292 log_handler.go:33] AWS request: elasticloadbalancing DescribeLoadBalancers
9882    2292 log_handler.go:33] AWS request: ec2 DescribeSecurityGroups
1322    2292 log_handler.go:33] AWS request: ec2 DescribeSecurityGroups
8421    2292 aws.go:2731] Error opening ingress rules for the load balancer to the instances: Multiple tagged security groups found for instance i-04bd9c4c8aa; ensure only the k8s security group is tagged
8469    2292 servicecontroller.go:754] Failed to process service. Retrying in 5m0s: Failed to create load balancer for service default/pushgateway: Mutiple tagged security groups found for instance i-04bd9c4c8aa36270e; ensure only the k8s security group is tagged
8480    2292 servicecontroller.go:724] Finished syncing service "default/pushgateway" (419.263237ms)
lines 201-224

https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L2783

// Returns the first security group for an instance, or nil
// We only create instances with one security group, so we don't expect multiple security groups.
// However, if there are multiple security groups, we will choose the one tagged with our cluster filter.
// Otherwise we will return an error.

The security groups in my case are:

k8s-minions-cncfdemo, k8s-masters-cncfdemo

They are both tagged with the cluster filter. Not expecting multiple security groups seems like a wrong (not to mentioned undocumented!) assumption.

Bit of a head scratcher.

Setting up a Bastion/Jump Server

Mosh

"Remote terminal application that allows roaming, supports intermittent connectivity"

The build server (#98) should greatly speed up things even for very stable connections because packer running within AWS will have by definition a tiny fraction of the latency than running from a laptop.

A Jump server with Mosh should hopefully be of similar benefit for flakey wifi.

Tmux

The build process is being altered to reduce the need to SSH into cluster instances, but sometimes that is inescapable. When you bring a lot of clusters up and down that becomes very tedious so its a lot easier to manage sessions with a multiplexer like tmux.

A nice potential bonus is collaboration on a session and some minor security benefits (only whitelist some traffic between the cluster(s) and the Bastion instance instead of to the entire world).

GCP support

intermittent kubedns responses for kubernetes service endpoints (1.3.5)

kubernetes/kubernetes#28497

Tl;dr on this particular saga -- multiple people with various types of environments/clusters experience kubedns connectivity issues.

What this means is that if you lookup a service internal to the cluster it sometimes doesn't resolve. This is core functionality and a show stopper. Reproducible but currently we don't understand why (I've suggested MountVolumes which is supposed to be a background loop is blocking somehow, even captured it on video).

cncf / demo Goto Github PK

demo's Introduction

CNCF Technologies Demonstration

Beta Preview

Table of Contents

Technologies

Summary of Sample Applications

Supported Deployments

CNCF Community Cluster

An Open Commitment

Disclaimer

Quick Start Guide (back to TOC)

1. Install cncfdemo

2. Create Cluster

3. Run Demo

The cncfdemo command shadows and complements the official Kubectl binary.

Complex, Scriptable Kubernetes Deployments & Jinja Templating

Behind the scenes

Future Plans

Architecture (back to TOC)

Image based Kubernetes deployments

Systemd drop-in files

Other useful settings cloud-init can override

Kubernetes Architecture

Three Groups

Cluster bootstrap via DNS discovery

Making the bookkeeping orthogonal to the deployment process

Software-defined networking for minions

Enabling CNI

Weave Quorum

Details of Sample Applications

Countly

Configuration Files

Deep Dive

Notes on Containerizing Apps

Picking a base image

Single process per container or... not

Resolving the PID 1 problem

Customizing Countly for Kubernetes

Example service

Decomposing apps into microservices for Kubernetes

Separation of concerns

InitContainers

Postmortem & Suggestions

Abstract

Picking a host operating system

CentOS 7

Default Docker is strongly discouraged for production use

The warning

What docker-storage-setup is trying (and failing) to do

Why this is a problem

To avoid these pitfalls, carefully select the storage driver

Overlay Dependencies

Properly configure netfilter

Consider disabling selinux

Correct CNI config

Other Dependencies

AWS specific requirements & debugging

Conclusion

demo's People

Contributors

Stargazers

Watchers

Forkers

demo's Issues

Egress

Ingress

The autoscaling custom metric

The backend (Mongo cluster)

The background job (Boinc)

A contrived example

Writing bash one liners is suboptimal

Outside Kubernetes

Systemd logs archiving to S3

Forward Output Plugin

Within Kubernetes

An implementation detail

Getting the logs back out again

Benefits

Follow up to #63.

Quick Start Guide _{^{(back to TOC)}}

The `cncfdemo` command shadows and complements the official Kubectl binary.

Architecture _{^{(back to TOC)}}