vmware-archive / wardroom Goto Github PK

A tool for creating Kubernetes-ready base operating system images.

License: Apache License 2.0

Ruby 12.30% Python 87.70%

ansible kubernetes image tools

wardroom's Introduction

wardroom

Wardroom provides tooling that helps simplify the deployment of a Kubernetes cluster. More specifically, Wardroom provides the following functionality:

Image Building: Building of Kubernetes-ready base operating system images using Packer and Ansible.
Deployment Orchestration: Ansible-based orchestration to deploy highly-available Kubernetes clusters using kubeadm.

Both use cases share a common set of Ansible roles that can be found in the ansible directory.

Image Building

Wardroom leverages Packer to build golden images of Kubernetes deployments across a wide variety of operating systems as well as image formats. During the build phase, Wardroom leverages Ansible to configure the base operating system and produce the Kubernetes-ready golden image.

This functionality is used to create base images for the Heptio aws-quickstart.

Supported Image Formats

Supported Operating Systems

Ubuntu 16.04 (Xenial)
Ubuntu 18.04 (Bionic)
CentOS 7

Deployment Orchestration

The swizzle directory contains an Ansible playbook that can be used to orchestrate the deployment of a Kubernetes cluster using kubeadm.

Documentation

Documentation and usage information can be found in the docs directory.

Contributing

See our contributing guidelines and our code of conduct. Contributions welcome by all.

Development

Vagrant may be used to test local ansible playbook development. In this scenario, Vagrant makes use of the ansible provisioner to configure the resulting operating system image. To test all operating systems simultaneously:

vagrant up

You may also selectively test a single operating system as such:

vagrant up [xenial|bionic|centos7]

To enable verbose ansible logging, you may do so by setting the WARDROOM_DEBUG environment variable to 'vvvv'.

The default Vagrant provisioner is Virtualbox, but other providers are possible by way of the vagrant-mutate plugin.

wardroom's People

Contributors

Stargazers

Watchers

Forkers

chuckha turokm alexxnica kryndex liztio erictcgs hexfusion markjacksonfishing gosharplite warmchang c6h3un detiber etsangsplk stevesloka mauilion mpfled asenchi opsbay timmycarr vincepri ncdc zimbraos lander2k2 lukekalbfleisch alexbrand riverzhang binnyrs bnusunny vijayrr007 johnschnake mg186050 craigtracey jpweber johnharris85 randomvariable codyde shuixianmu boluisa mbah2017 mbahvw davar-playgrounds frankfanslc iq-scm

wardroom's Issues

Pre-fill kubeadm config with optional override.

As an administrator I would like to bake my images with a set of well defines configuration options with the ability to override during initial image creation.

Set hostname to FQDN

For AWS cloud provider integration, it's necessary for the hostname of the node to match AWS' private DNS entry. Is it worth us building a task/role into Wardroom to set the hostname to the FQDN so that it matches the AWS private DNS entry?

What do you think about adding `jq` to the base image?

I have found this is a fairly common utility and it would simplify the aws-quickstart to have jq already present on the image.

Is this reasonable for wardroom to have or should consumers be modifying the image as needed?

Playbooks specify Docker CE 17.03.2 but actually install 18.03.1

The "docker" role in the Ansible playbooks specifies to install Docker CE 17.03.2, but when they are run against a CentOS 7.4 host (and this presumably will affect RHEL also, although I haven't tested yet) they actually install Docker CE 18.03.1. This version isn't yet validated for use with Kubernetes (although it may work just fine).

This output was taken from a CentOS 7.4 host after running the playbooks:

[centos@ip-10-16-13-114 ~]$ docker --version
Docker version 18.03.1-ce, build 9ee9f40
[centos@ip-10-16-13-114 ~]$

I ran into this problem recently with a customer. When I have a moment, I'll see about creating a PR to port the changes I implemented to fix this issue.

NTP knobs..

Need to setup sane defaults for NTP and have a knob'd override for large scale vendors.

Won't add nodes if masters have already initialized

The "kubernetes-master" role uses kubeadm token generate to output a new token but not actually create the token on the server. This generated token is captured in a variable which is then used in a templated configuration file supplied to kubeadm init on the master(s). If the masters have already been initialized, then the token is never actually created on the server, and therefore can't be used to join nodes to the cluster. The end result is that the playbooks cannot be used to configure nodes once the masters have been initialized.

Add support for containerd-based images

Currently wardroom bakes images that use docker, we should add the ability to generate images that use containerd instead.

Dog-fooding

Ideally we use this for helping create the quick-start.

Dunno if this will be in ready in time for the next iteration.

/cc @chuckha

Configure overlay2

Setup to default to overlay2 for centos requires a couple of steps.

/cc @johnbyrneio

Remove swap from fstab

Currently the code simply disables swap at runtime. We should also remove any entries from fstab.

This is currently in ansible/roles/kubernetes-common/tasks/main.yml

Add kubelet cloud provider arg

Wardroom should be putting the appropriate kubelet --cloud-provider flag in the systemd conf file.

We could reasonably edit the file directly after installing the package and not try to use KUBELET_EXTRA_ARGS.

Clean up AWS usage

Right now the images are created and live in hepito private dev accounts. The AMIs are made public so it's not really a big deal, but it is hard to track/cleanup if we ever need to. Images are also subject to disappearing if a user's account is deleted and all resources are cleaned up.

cc @asenchi @stevesloka

Wardroom is not compatible with Ubuntu 18.04

This is a longer-term issue and not an immediate need, so feel free to defer until a later time.

I did some testing today with Ubuntu 18.04 to see if Wardroom could be used to build Kubernetes-ready images. I found problems with a number of areas:

Installing Docker (can't install CE 17.03.x on Ubuntu 18.04, none of the repositories have it)
The Kubernetes role fails when adding the repo (looks for "bionic" when it should still be "xenial")

There may be other problems; this was as far as my limited testing allowed me to discover. I know that this isn't urgent/needed right now, but wanted to capture this information for use later when this is something we want to tackle.

kubelet config file

I just noticed, running kubeadm init that --pod-manifest-path has been deprecated in favor of the kubelet's --config flag.

This feels like it might be an upstream thing where the kubelet package should be updated (maybe it's already under way). I want to track this somewhere since it's a deprecation warning. Feel free to close if this is not the right place for that.

Apr 30 18:22:32 ip-172-31-38-145 kubelet[3157]: Flag --pod-manifest-path has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ is the full message I saw.

Fix deprecation warnings

There's a couple of places where we are using deprecated ansible features/instructions/commands.

TASK [common : install baseline dependencies] **********************************************************************************************************************************************************
[DEPRECATION WARNING]: Invoking "apt" only once while using a loop via squash_actions is deprecated. Instead of using a loop to supply multiple items and specifying `name: {{ item }}`, please use
`name: '{{ common_debs }}'` and remove the loop. This feature will be removed in version 2.11. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.

TASK [docker : install docker] *************************************************************************************************************************************************************************
[DEPRECATION WARNING]: Invoking "apt" only once while using a loop via squash_actions is deprecated. Instead of using a loop to supply multiple items and specifying `name: {{ item }}`, please use
`name: ['docker-ce={{ docker_debian_version }}']` and remove the loop. This feature will be removed in version 2.11. Deprecation warnings can be disabled by setting deprecation_warnings=False in
ansible.cfg.
[DEPRECATION WARNING]: Invoking "apt" only once while using a loop via squash_actions is deprecated. Instead of using a loop to supply multiple items and specifying `name: {{ item }}`, please use
`name: ['docker-ce={{ docker_debian_version }}']` and remove the loop. This feature will be removed in version 2.11. Deprecation warnings can be disabled by setting deprecation_warnings=False in
ansible.cfg.

TASK [docker : add docker dependencies] ****************************************************************************************************************************************************************
[DEPRECATION WARNING]: Invoking "yum" only once while using a loop via squash_actions is deprecated. Instead of using a loop to supply multiple items and specifying `name: {{ item }}`, please use
`name: ['device-mapper-persistent-data', 'lvm2']` and remove the loop. This feature will be removed in version 2.11. Deprecation warnings can be disabled by setting deprecation_warnings=False in
ansible.cfg

TASK [kubernetes : install kubernetes packages] ********************************************************************************************************************************************************
[DEPRECATION WARNING]: Invoking "apt" only once while using a loop via squash_actions is deprecated. Instead of using a loop to supply multiple items and specifying `name: {{ item }}`, please use
`name: ["kubelet={{ kubernetes_version | kube_platform_version('debian') }}", "kubeadm={{ kubernetes_version | kube_platform_version('debian') }}", "kubectl={{ kubernetes_version |
kube_platform_version('debian') }}", "kubernetes-cni={{kubernetes_cni_version | kube_platform_version('debian') }}"]` and remove the loop. This feature will be removed in version 2.11. Deprecation
warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.

TASK [kubernetes : install the kubernetes yum packages] ************************************************************************************************************************************************
[DEPRECATION WARNING]: Invoking "yum" only once while using a loop via squash_actions is deprecated. Instead of using a loop to supply multiple items and specifying `name: {{ item }}`, please use
`name: ["kubelet-{{ kubernetes_version | kube_platform_version('redhat') }}", "kubeadm-{{ kubernetes_version | kube_platform_version('redhat') }}", "kubectl-{{ kubernetes_version |
kube_platform_version('redhat') }}", "kubernetes-cni-{{kubernetes_cni_version | kube_platform_version('redhat')}}"]` and remove the loop. This feature will be removed in version 2.11. Deprecation
warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.

Support image preloading when building wardroom images

Support and expose an option to preload container images when building wardroom images.

Optionally - Pre-pull images

This can be weird with images on how to handle pre-pulling the base docker images.

Enumerate necessary permissions

We should enumerate the permissions necessary for the packer builds, per cloud provider. It can be frustrating when this is unclear, and this would make a much better user experience.

Typo in README.md, missing the letter "a"

😄 , should be

https://github.com/heptio/aws-quickstart

Intermittent kubeadm failures due to /proc/sys/net/bridge/bridge-nf-call-iptables

Using Wardroom to generate a CentOS 7.4-based AMI results in intermittent failures running kubeadm on instances launched from said AMI. The kubeadm error is:

	[ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables contents are not set to 1

This is using kubeadm version 1.11.1 on CentOS 7.4.

Cleanup user directory

I see a ~vince dir in the wardroom image. That should definitely not be there.

Use of "primary_master" group breaks use of playbooks apart from provision.py

The use of the "primary_master" inventory group (introduced via #29) breaks the use of the Ansible playbooks apart from running them with the provision.py script (which creates and populates the "primary_master" group).

A potential workaround is to use a nested group in the Ansible inventory file, like this:

[primary_master:children]
<name of group that contains master nodes>

This would allow use of the randomisation functionality in provision.py as well as preserve functionality when using the playbooks apart from the script, but it does require specific configuration in the inventory in order to work.

Consider including cloud-utils in AWS AMIs

kubicorn bootstrap scripts require ec2metadata which comes from the cloud-utils package. Could we consider adding cloud-utils to all AWS AMIs?

@detiber

Install docker-py only when required

We currently install docker-py unconditionally as part of the docker installation.

Docker-py is required to use the ansible docker_image module. This module is used in roles/kubernetes/tasks/main.yml to download and cache docker images on the nodes.

The caching of images is only performed when kubernetes_enable_cached_images == true. It seems like we could make the installation of docker-py conditional on that same variable.

Add image testing

kubernetes-sigs/cluster-api-provider-aws#538 adds image testing using goss. Might be worth porting to wardroom

create a python module for the scripts

As we script this with python, we will have dependencies that need to be managed. Likewise, we may want to take advantage of some of the features of being an actual module (ie. entry points). So, make of python module of this code.

Add support for image building for Oracle Cloud

There is a need for enabling Wardroom to build base images on Oracle Cloud. It looks as if HashiCorp Packer (as of 1.2.0 or later) includes a builder to build base OS images for Oracle Cloud, so this should be a fairly straightforward addition.

Python pre-install is debian specific

The python pre-install happens in several places:

https://github.com/heptiolabs/wardroom/blob/master/ansible/playbook.yml#L6
https://github.com/heptiolabs/wardroom/blob/master/ansible/pre.yml#L5
https://github.com/heptiolabs/wardroom/blob/master/ansible/provider.yml#L7

Python should be installed in a platform agnostic way and should not be duplicated like this unless it really has to be.

Audit Log Defaults.

We will need to bake this in for 1.10+ post @chuckha 's work.

Firewall rules

Create the punches up front to avoid the logistics of running around.

Update docker versions for CVE-2019-5736

Need to verify that the current versions of Docker installed for Ubuntu Bionic and CentOS are not vulnerable to CVE-2019-5736. The version that was recently added for Xenial in #125 already includes the fix.

Retrieve existing token from master to join new nodes after initial deployment

Currently master token is generated fresh every time, but only used during initial deployment. If adding nodes later, new nodes attempt to join with this new key, does not match existing key so node join fails.

@craigtracey

AWS region now hard-coded in packer.json

Commit cab02b6 (part of PR #121) hardcodes the "aws_region" variable in packer/packer.json, setting the value to "us-east-1". This causes the instructions in packer/README.md for building AWS images to break for regions other than us-east-1.

It looks like there are (at least) three potential fixes:

Update packer/README.md to instruct users to add -var aws_region=<region> for building AMIs in regions other than us-east-1.
Update packer/packer.json to remove the hard-coded value for the "aws_region" variable.
Update the AWS region variable files (aws-us-east-1.json and aws-us-west-2.json) to set the "aws_region" variable for that specific region.

If you will let me know which approach you prefer for the project, I'm happy to craft a PR implementing that approach. Thanks!

Ability to build base images that have everything installed except kubernetes

I would like to be able to run packer to build images, but I don't want kubernetes baked in to the image (I'll bring my own). Is there some way to use a packer var and an ansible var to exclude the kubernetes role from the playbook?

@detiber @timothysc

etcd role listed as dependency for k8s-master

In /roles/kubernetes-master/meta/main.yml etcd is a dependency of master. Swizzle.yml already installs etcd to nodes with ansible role etcd, but this dependency also installs etcd to the masters - not desirable when etcd and masters are separate computers (vagrant file seems to have them the same)

@craigtracey

etcd source url hardcoded

In corporate envs github may not be directly available. Currently the etcd url is hardcoded as a var, which doesn't seem to be able to be overridden by host/group vars set elsewhere. Also, group "etcd" is hard coded, but if user has differently named hosts and is just consuming the roles this can't be overridden.
Moving those to etcd/defaults/main.yml seems to allow users to override as needed

@craigtracey

Make AMIs discoverable

I cannot find the wardroom AMIs anywhere in the amazon UI. It would be fantastic if we had a place that listed all the AMIs by k8s version/os. Very much like https://cloud-images.ubuntu.com/locator/ec2/ except heptio.

wardoom-images.heptio.com would be dope. But honestly I'd settle for a markdown file.

Node tests fail to run

https://github.com/heptiolabs/wardroom/tree/master/packer#testing-images mentions testing nodes directly via e2e_node.test but there have been upstream changes that seem to have broken this functionality at this time.

I get issues with the kubelet failing to respond to readiness checks even though the cluster otherwise is in a healthy state and passes conformance tests (via sonobuoy).

Errors in the logs look like:

Failure [130.525 seconds]
[BeforeSuite] BeforeSuite
_output/dockerized/go/src/k8s.io/kubernetes/test/e2e_node/e2e_node_suite_test.go:142

  Node 1 disappeared before completing BeforeSuite

  _output/dockerized/go/src/k8s.io/kubernetes/test/e2e_node/e2e_node_suite_test.go:142
------------------------------

And the kubelet itself has lots of warnings:

06:12:48 ip-10-0-21-210 kubelet[2270]: W0126 06:12:48.570608    2270 container.go:409] Failed to create summary reader for "/libcontainer_9469_systemd_test_default.slice": none of the resources are being tracked.
Jan 26 06:12:58 ip-10-0-21-210 kubelet[2270]: W0126 06:12:58.309639    2270 setters.go:144] replacing cloudprovider-reported hostname of ip-10-0-21-210.ec2.internal with overridden hostname of ip-10-0-21-210.ec2.internal
Jan 26 06:12:58 ip-10-0-21-210 kubelet[2270]: W0126 06:12:58.598792    2270 container.go:523] Failed to update stats for container "/libcontainer_9542_systemd_test_default.slice": failed to parse memory.failcnt - open /sys/fs/cgroup/memory/libcontainer_9542_systemd_test_default.slice/memory.failcnt: no such file or directory, continuing to push stats
Jan 26 06:12:58 ip-10-0-21-210 kubelet[2270]: W0126 06:12:58.599096    2270 raw.go:87] Error while processing event ("/sys/fs/cgroup/memory/libcontainer_9542_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/memory/libcontainer_9542_systemd_test_default.slice: no such file or directory
Jan 26 06:12:58 ip-10-0-21-210 kubelet[2270]: W0126 06:12:58.599243    2270 raw.go:87] Error while processing event ("/sys/fs/cgroup/devices/libcontainer_9542_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/libcontainer_9542_systemd_test_default.slice: no such file or directory

bump CNI

As per 1.9 release notes our CNI should be using

https://github.com/kubernetes/kubernetes/pull/51250/files#diff-7f1c98c90062b56dffdcdd67c36a0e79R54

for the URL.

Note: This is actually a slightly different archive than the one we are using. They are no longer wrapping all the bins in a bin directory so we will need to create a bin directory and extract into that.

Investigate ansible temporary folder

Follow up issue from #90 (comment)

[dev branch] kubeadm template depends on kubeadm token which gets created after template is rendered

The kubeadm.conf template contains the kubeadm token, but it looks like the token is no longer generated before the template is rendered.

Changeset of interest: fc7cce9

TASK [kubernetes-master : drop kubeadm template] ************************************************************************************************************************************************************************************************************
fatal: [master-0 -> 54.202.136.38]: FAILED! => {"changed": false, "msg": "AnsibleUndefinedVariable: 'generated_token' is undefined"}

Investigate leftover artifact in /etc/hosts for OCI builds

It appears that Wardroom (when used to build images) leaves behind a leftover hostname artifact in /etc/hosts when building OCI images. We need to investigate and correct whatever may be creating this leftover artifact.

IP routing being configured in multiple places

Right now IP routing is being configured in places it shouldn't be:

ansible/roles/kubernetes-cni/tasks/flannel.yml
ansible/roles/common/tasks/redhat.yml

As this is always required (for both images as well as swizzle), move these sysctl settings to a common place (maybe the kubernetes role?).

Allow kubernetes repo url to be configurable

Currently the k8s apt key and repo urls are hardcoded (https://packages.cloud.google.com/apt/doc/apt-key.gpg, http://apt.kubernetes.io), which may not be accurate within firewalled environments

@craigtracey

[swizzle] Support environments that are behind a proxy

Many large organizations typically require all internet-bound traffic to traverse an HTTP proxy.

We should support the ability to specify proxy configuration and flow that config to the swizzle tasks that reach out to the internet.

[swizzle] Kubernetes packages are being installed on etcd nodes

Currently, the swizzle playbook is installing the kubernetes packages (and prereqs) to the etcd nodes. This is fine when deploying stacked masters, but seems unnecessary when deploying etcd as an external cluster on dedicated nodes.

In the external cluster scenario, there is no need for docker, kubelet, kubectl, etc on the etcd machines, given that etcd is running as a systemd service.

Remove cloud-init artifacts from images

Right now it looks like we are not cleaning up the cloud-init state after we have provisioned images. We should do that.

Enable TLS for etcd

Right now etcd is running without TLS. We should explore options for having wardroom generate and apply certs.

Add tests

This repo needs tests. Minimally, we should:

attempt to build images upon PR
use molecule to run unit tests

Output Artifact -> meta container

Today we produce images for VM infra, but for on prem environments there are a set of use cases that could benefit from a meta-container that lumps all of the repos deps into a local container.

@luxas created a container like this in the past, but the purpose of using wardroom is to allow others to generate the artifact(s).

/cc @detiber

vmware-archive / wardroom Goto Github PK

wardroom's Introduction

wardroom

Image Building

Supported Image Formats

Supported Operating Systems

Deployment Orchestration

Documentation

Contributing

Development

wardroom's People

Contributors

Stargazers

Watchers

Forkers

wardroom's Issues

Recommend Projects

Recommend Topics

Recommend Org