worker-operator

A Kubernetes operator for running a cluster of Travis CI worker processes.

Why?

None of the built-in strategies in Kubernete's Deployments are sufficient alone to roll out a new version of worker correctly.

worker needs to be careful to not run too many concurrent jobs, since those jobs depend on external computing resources that are not infinite. Gracefully shutting down an instance of worker requires waiting for the currently running jobs to finish, which can take on the order of hours rather than seconds. While we can handle a forceful termination by restarting jobs, we don't want to make it a regular part of our rollouts.

During a graceful shutdown, it's important that the new worker processes brought up to replace those do not try to run too many new jobs. Normally, the pool size of a worker is fixed and configured in the environment of the pod. In this model, we need to manage the number of new worker instances that are brought up so that the rollout is gradual and keeps the overall number of concurrent jobs in an acceptable range.

Kubernetes doesn't actually support this, though. It will only wait for a pod to be marked for deletion (and sent its stop signal), but not for the pod to actually finish its work. This is probably fine when your pods don't use a finite external resource and can normally shut down in a few seconds, but in our case, we really need to wait for this shutdown to happen.

What does `worker-operator` do?

It provides a new WorkerCluster custom resource. Each WorkerCluster creates an ordinary Deployment, so that we can reuse as much normal Kubernetes behavior as possible.

A WorkerCluster does not specify a number of replicas or how many jobs each replica should run, like we would have to when running worker just in a Deployment. Instead, when you create a WorkerCluster, you specify the maximum number of jobs that should run concurrently. The operator does the math to determine how many replicas should be running based on that. If you reconfigure the cluster to run a different number of jobs, it adjusts the deployment appropriately if needed.

But that's just a little math. It doesn't really justify a whole operator. What's interesting is what happens when we rollout a new version or configuration of worker using a WorkerCluster.

The main goal of worker-operator is to dynamically adjust pool sizes of workers. When rolling out new worker pods, the operator will take an inventory of all the pods running in the WorkerCluster, noting their current pool size and whether they are running normally or in the process of shutting down. The operator will scale up the pools in the new processes as the old processes naturally drain their pools.

This means we can use the normal behavior of Kubernetes to roll out new pods immediately and just scale them up incrementally rather than all at once. All of this happens without any manual intervention.

Examples

The following YAML will create a deployment of three workers with pool size totalling to 8.

apiVersion: travisci.com/v1alpha1
kind: WorkerCluster
metadata:
  name: example-worker
spec:
  maxJobs: 8
  maxJobsPerWorker: 3

  selector:
    matchLabels:
      name: example-worker
  template:
    metadata:
      labels:
        name: example-worker
    spec:
      image: travisci/worker:v6.0.0
      imagePullPolicy: IfNotPresent
      env:
      - name: TRAVIS_WORKER_JUPITERBRAIN_SSH_KEY_PATH
        value: /etc/worker/ssh/travis-vm.key
      envFrom:
      - configMapRef:
          name: worker-com
      - secretRef:
          name: worker-com
      sshKeySecret: worker-com-vm-key

You can see what the pool sizes are for the individual workers by inspecting the status subresource of the worker cluster:

$ kubectl get -o yaml workercluster example-worker
...
status:
  workerStatuses:
  - currentPoolSize: 3
    expectedPoolSize: 3
    name: example-worker-668cf57bff-254w7
    phase: Running
    requestedPoolSize: 3
  - currentPoolSize: 3
    expectedPoolSize: 3
    name: example-worker-668cf57bff-b6nj4
    phase: Running
    requestedPoolSize: 3
  - currentPoolSize: 2
    expectedPoolSize: 2
    name: example-worker-668cf57bff-bmbwn
    phase: Running
    requestedPoolSize: 2
...

montana / worker-operator Goto Github PK

worker-operator's Introduction

worker-operator

Why?

What does `worker-operator` do?

Examples

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

montana / worker-operator Goto Github PK

worker-operator's Introduction

worker-operator

Why?

What does worker-operator do?

Examples

Recommend Projects

Recommend Topics

Recommend Org

What does `worker-operator` do?