Code Monkey home page Code Monkey logo

worker-operator's Introduction

worker-operator

A Kubernetes operator for running a cluster of Travis CI worker processes.

Why?

None of the built-in strategies in Kubernete's Deployments are sufficient alone to roll out a new version of worker correctly.

worker needs to be careful to not run too many concurrent jobs, since those jobs depend on external computing resources that are not infinite. Gracefully shutting down an instance of worker requires waiting for the currently running jobs to finish, which can take on the order of hours rather than seconds. While we can handle a forceful termination by restarting jobs, we don't want to make it a regular part of our rollouts.

During a graceful shutdown, it's important that the new worker processes brought up to replace those do not try to run too many new jobs. Normally, the pool size of a worker is fixed and configured in the environment of the pod. In this model, we need to manage the number of new worker instances that are brought up so that the rollout is gradual and keeps the overall number of concurrent jobs in an acceptable range.

Kubernetes doesn't actually support this, though. It will only wait for a pod to be marked for deletion (and sent its stop signal), but not for the pod to actually finish its work. This is probably fine when your pods don't use a finite external resource and can normally shut down in a few seconds, but in our case, we really need to wait for this shutdown to happen.

What does worker-operator do?

It provides a new WorkerCluster custom resource. Each WorkerCluster creates an ordinary Deployment, so that we can reuse as much normal Kubernetes behavior as possible.

A WorkerCluster does not specify a number of replicas or how many jobs each replica should run, like we would have to when running worker just in a Deployment. Instead, when you create a WorkerCluster, you specify the maximum number of jobs that should run concurrently. The operator does the math to determine how many replicas should be running based on that. If you reconfigure the cluster to run a different number of jobs, it adjusts the deployment appropriately if needed.

But that's just a little math. It doesn't really justify a whole operator. What's interesting is what happens when we rollout a new version or configuration of worker using a WorkerCluster.

The main goal of worker-operator is to dynamically adjust pool sizes of workers. When rolling out new worker pods, the operator will take an inventory of all the pods running in the WorkerCluster, noting their current pool size and whether they are running normally or in the process of shutting down. The operator will scale up the pools in the new processes as the old processes naturally drain their pools.

This means we can use the normal behavior of Kubernetes to roll out new pods immediately and just scale them up incrementally rather than all at once. All of this happens without any manual intervention.

Examples

The following YAML will create a deployment of three workers with pool size totalling to 8.

apiVersion: travisci.com/v1alpha1
kind: WorkerCluster
metadata:
  name: example-worker
spec:
  maxJobs: 8
  maxJobsPerWorker: 3

  selector:
    matchLabels:
      name: example-worker
  template:
    metadata:
      labels:
        name: example-worker
    spec:
      image: travisci/worker:v6.0.0
      imagePullPolicy: IfNotPresent
      env:
      - name: TRAVIS_WORKER_JUPITERBRAIN_SSH_KEY_PATH
        value: /etc/worker/ssh/travis-vm.key
      envFrom:
      - configMapRef:
          name: worker-com
      - secretRef:
          name: worker-com
      sshKeySecret: worker-com-vm-key

You can see what the pool sizes are for the individual workers by inspecting the status subresource of the worker cluster:

$ kubectl get -o yaml workercluster example-worker
...
status:
  workerStatuses:
  - currentPoolSize: 3
    expectedPoolSize: 3
    name: example-worker-668cf57bff-254w7
    phase: Running
    requestedPoolSize: 3
  - currentPoolSize: 3
    expectedPoolSize: 3
    name: example-worker-668cf57bff-b6nj4
    phase: Running
    requestedPoolSize: 3
  - currentPoolSize: 2
    expectedPoolSize: 2
    name: example-worker-668cf57bff-bmbwn
    phase: Running
    requestedPoolSize: 2
...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.