gardener / kupid Goto Github PK

Inject scheduling criteria into target pods orthogonally by policy definition.

License: Apache License 2.0

Dockerfile 0.68% Makefile 3.26% Go 90.82% Shell 4.52% Smarty 0.71%

kubernetes golang kubernetes-controller kubernetes-operator scheduling-policies scheduler scheduling scheduling-algorithms tainting tolerations affinity pod-scheduling-criteria nodeaffinity webhook

kupid's Introduction

kupid

Inject scheduling criteria into target pods orthogonally by policy definition.

Content

kupid

Goals

Declare and manage many different forms of pod scheduling criteria for pods in a Kubernetes cluster. This includes affinity (for node, pod and anti-affinity), nodeName, nodeSelector, schedulerName and tolerations.
Dynamically inject the maintained relevant pod scheduling criteria to the pods during pod creation.
Allow pods to declare their own scheduling criteria which would override any declaratively maintained policy in case of conflict.
Allow some namespaces and/or pods to be selected (or not selected) as targets for scheduling policies based on the label selector mechanism.
Generally, make it possible to cleanly separate (and orthogonally enforce) the concerns of how and where the workload deployed on a Kubernetes cluster should be scheduled from the controllers/operators that manage them.
Enable Gardener to deploy such a mechanism to inject pod scheduling criteria orthogonally into seed cluster workloads by deploying kupid as a Gardener extension along with suitable scheduling policy instances. This is especially relevant for supporting dedicated worker pools for shoot etcd workload in the seed clusters.

Non-goals

Prevent pods from declaring their own scheduling criteria.
Prevent Gardener from supporting seed clusters which do not have any dedicated worker pools or any form of pod scheduling criteria for seed cluster workload.

Development Installation

The steps for installing kupid on a Kubernetes cluster for development and/or trial are given below. These are only development installation steps and not intended for any kind of production scenarios. For anything other than development or trial purposes, please use your favorite CI/CD toolkit.

Building the docker image

The following steps explain how to build a docker image for kupid from the sources. It is an optional step and can be skipped if the upstream docker image can be used.

Build kupid locally. This step is optional if you are using upstream container image for kupid.

make webhook

Build kupid container image. This step is optional if you are using upstream container image for kupid.

make docker-build

Push the container image to the container repository. This step is optional if you are using upstream container image for kupid.

make docker-push

Deploying kupid

Please follow the following steps to deploy kupid resources on the target Kubernetes cluster.

Pre-requisites

The development environment relies on kustomize. Please install it in your development environment.

Using self-generated certificates

Kupid requires TLS certificates to be configures for its validating and mutating webhooks. Kupid optionally supports generating the required TLS certificates and the default ValidatingWebhookConfiguration and MutatingWebhookConfiguration automatically.

Deploy the resources based on config/default/kustomization.yaml which can be further customized (if required) before executing this step.

make deploy

Using cert-manager

Alternatively, kupid can be deployed with externally generated TLS certificates and custom ValidatingWebhookConfiguration and MutatingWebhookConfiguration. Below is an example of doing this using cert-manager. Please make sure the target Kubernetes cluster you want to deploy kupid to has a working installation of cert-manager.

Deploy the resources based on config/using-certmanager/kustomization.yaml which can be further customized (if required) before executing this step.

make deploy-using-certmanager

Context

Kubernetes API provides many mechanisms for pods to influence how and where (which node) they get scheduling in/by the Kubernetes cluster. All such mechanisms involve the pods declaring things in their PodSpec. At present, there are five such mechanisms.

`affinity`

Affinity is one of the more sophisticated ways for a pod to influence where (which node) it gets scheduled.

It has three further sub-mechanisms.

`nodeAffinity`

NodeAffinity is similar to but a more sophisticated way than the nodeSelector to constrain the viable candidate subset of nodes in the cluster as a scheduling target for the pod. An example of how it can be used can be seen here.

`podAffinity`

PodAffinity is more subtle way to constrain the viable candidate subset of nodes in the cluster as a scheduling target for the pod. In contrast to nodeAffinity, this is done not by directly identifying the viable candidate nodes by node label selector terms. Instead, it is done by selecting some other already scheduled pods that this pod should be collocated with. An example of how it can be used can be seen here.

`podAntiAffinity`

PodAAntiffinity works in a way that is opposite of podAffinity. It constrains the viable candidate nodes by selecting some other already scheduled pods that this pod should not be collocated with. An example of how it can be used can be seen here.

`nodeName`

NodeName is a very crude way that bypasses the whole pod scheduling mechanism by the pod itself declaring which node it wants to be scheduled on.

`nodeSelector`

NodeSelector is a simple way to constrain the viable candidate nodes for scheduling by specifying a label selector that select such viable nodes. An example of how it can be used can be seen here.

`schedulerName`

Kubernetes supports multiple schedulers that can schedule workload in it. The individual pods can declare which scheduler should scheduler them in the schedulerName. The additional schedulers should be separately deployed, of course.

`tolerations`

Kubernetes supports the functionality of taints which allow nodes to declaratively repel pods from being scheduled on them. Pods that want to get scheduled on such tainted nodes need to declare tolerations to such taints. Typically, this functionality is used in combination with other ways of attracting these pods to get scheduled on such tainted nodes, such as nodeAffinity, nodeSelector etc.

Problem

All the mechanisms for influencing the scheduling of pods described above have to be specified top-down (or in other words, vertically) by the pods themselves (or any higher order component/controller/operator that deploys them).
Such top-down approach forces all the components up the chain to be aware of the details of these mechanisms. I.e. they either make some assumptions at some stage about the pod scheduling criteria or expose the flexibility of specifying such pod scheduling criteria all the way up the chain.
Specifically, in the Gardener seed cluster, some workloads like etcd might be better off scheduled on dedicated worker pools so that other workloads and the common nodes on which they are scheduled can be scaled up and down by the Cluster Autoscaler more efficiently. This approach might be used for other workloads too for other reasons in the future (pre-emptible nodes for controller workloads?).
However, Gardener must not force all seed clusters to always have dedicated worker pools. It should be always possible to use Gardener with plain-vanilla seed clusters with no dedicated worker pools. The support for dedicated worker pools should be optional.

Solution

The proposed solution is to declare the pod scheduling criteria as described above in a CustomResourceDefinition and then inject the relevant specified pod scheduling criteria into pods orthogonally when they are created via a mutating webhook.

Sequence Diagram

`PodSchedulingPolicy`

PodSchedulingPolicy is a namespaced CRD which describes, in its spec, all the pod scheduling criteria described above.

The criteria for selecting target pods on which the PodSchedulingPolicy is applied can be specified in the spec.podSelector.

`ClusterPodSchedulingPolicy`

ClusterPodSchedulingPolicy is similar to the PodSchedulingPolicy, but it is a non-namespaced (cluster-scoped) CRD which describes, in its spec, all the pod scheduling criteria described above.

The criteria for selecting target pods on which the ClusterPodSchedulingPolicy is applied can be specified in the spec.podSelector.

In addition, it allows specifying the target namespaces to which the ClusterPodSchedulingPolicy is applied via spec.namespaceSelector.

Only a pod whose namespace matches the spec.namespaceSelector and also matches the spec.podSelector will be applied the specified pod scheduling policy.

An explicitly specified empty selector would match all objects (i.e. namespaces and pods respectively).

A nil selector (i.e. not specified in the spec) will match no objects (i.e. namespaces and pods respectively).

Support for top-down pod scheduling criteria

Pods can continue to specify their scheduling criteria explicitly in a top-down way.

One way to make this possible is to use the spec.namespaceSelector and spec.podSelector judiciously so that the pods that specify their own scheduling criteria do not get targeted by any of the declared scheduling policies.

If any additional declared PodSchedulingPolicy or ClusterPodSchedulingPolicy are applicable for such pods, then the pod scheduling criteria will be merged with the already defined scheduling criteria specified in the pod.

During merging, if there is a conflict between the already existing pod scheduling criteria and the additional pod scheduling criteria that is being merged, then only the non-conflicting part of the additional pod scheduling criteria will be merged, and the conflicting part will be skipped.

Gardener Integration Sequence Diagram

Pros

This solution has the following benefits.

Systems that provision and manager workloads on the clusters such as CI/CD pipelines, helm charts, operators and controllers do not have to embed the knowledge of cluster topology.
A cluster administrator can inject cluster topology constraints into scheduling of workloads. Constraints which are not taken into account by the provisioning systems.
A cluster administrator can enforce some default cluster topology constraints into the workload as a policy.

Cons

Pod creations go through an additional mutating webhook. The scheduling performance impact of this can be mitigated by using the namespaceSelector and podSelector fields in the policies judiciously.
Pods already scheduled in the cluster will not be affected by newly created policies. Pods must be recreated to get the new policies applied.

Mutating higher-order controllers

Though this document talks about mutating pods dynamically to inject declaratively defined scheduling policies, in principle, it might be useful to mutate the pod templates in higher order controller resources like replicationcontrollers, replicasets, deployments, statefulsets, daemonsets, jobs and cronjobs instead of (or in addition to) mutating pods directly. This is supported by kupid. Which objects are mutated is now controllable in the MutatingWebhookConfiguration.

Sequence Diagram

Gardener Integration Sequence Diagram

Alternatives

Propagate flexibility up the chain

Expose the flexibility of specifying pod scheduling mechanism all the way up the chain. I.e. in deployments, statefulsets, operator CRDs, helm chart configuration or some other form of configuration. This suffers from polluting many layers with information that is not too relevant at those levels.

Make assumptions

Make some assumptions about the pod scheduling mechanism at some level of deployment and management of the workload. This would not be flexible and will make it hard to change the pod scheduling behavior.

Prior Art

`PodPreset`

The standard PodPreset resource limits itself to the dynamic injection of only environment variables, secrets, configmaps, volumes and volume mounts into pods. There is mechanism to define and inject other fields (especially, those related to scheduling) into pods.

Banzai Cloud Spot Config Webhook

The spot-config-webhook limits itself to the dynamic injection of the schedulerName into pods. There is no mechanism to define and inject other fields like affinity, tolerations etc.

OPA Gatekeeper

The OPA Gatekeeper allows to define a policy to validate and mutate any kubernetes resource. Technically, this can be used to dynamically inject anything, including scheduling policy into pods. But this is too big a component to introduce just to dynamically inject scheduling policy. Besides, the policy definition as code is undesirable in this context because the policy itself would be non-declarative and hard to validate while deploying the policy.

kupid's People

Contributors

Stargazers

Watchers

Forkers

amshuman-kr petersutter gowrisankar22 dkistner ashwani2k rfranzke shreyas-s-rao timebertt himanshu-kun krgostev isabella232 acumino timuthy unmarshall luis-sousa-pinto raphaelvogel renormalize vogelhome ialidzhikov

kupid's Issues

Kupid patches out `minDomains` from TopologySpreadConstraints

What happened:
Gardener sets topologySpreadConstraints[].minDomains (added in Kubernetes 1.24) in StatefulSets and Deployments. With Kupid modifying affected resources, the field is removed again.

What you expected to happen:
minDomains to remain in the podTemplate specification.

How to reproduce it (as minimally and precisely as possible):
Deploy and configure Kupid to modify a Deployment, and set topologySpreadConstraints[].minDomains`.

Anything else we need to know:
Kupid's sources are based on k8s.io/api v0.23.3 which does not include this field. This is possibly the root cause of the problem, i.e. the patch is not calculated properly.

Handling failures in applying the Scheduling policies.

What would you like to be added:
Currently for Kupid it ignore failure on applying the scheduling policies.
Either we should fail in case the policy is not applied so that there are no side effects of pod being scheduled on workers which it was not defined in the scheduling policies or
We should log such errors and raise alerts to bring to the operator of such scheduling of pods to have a possibility to react to the anomaly.

Why is this needed:
Now in real scenario this can have pros and cons --
Pro

It ensure critical components like etcd are scheduled even if not on the worker pool the policy describes but whatever scheduler prescribes.

Cons

If scheduled on an unintended worker, the health of the worker can affect the availability of etcd and might get rolled even if there is nothing wrong with the etcd. This wouldn't have happened had the etcd pod was deployed on the intended worker as per the policy definitions.

Improve error logging

What would you like to be added:
Log errors at a level that is is convenient to configure without flooding the logs.

Why is this needed:
Help debugging issues while not flooding the logs.

Add healthz endpoint and metrics

What would you like to be added:
Add support for healthz endpoint and expose metrics.

Why is this needed:
To enabled livenessProbe and also to collect relevant metrics.

Mutating webhook should handle only relevant requests

What would you like to be added:

I would like Kupid's mutating webhook to only handle the requests that are relevant to it by using an ObjectSelector in the webhook configuration. The object selector can be set based on the PSPs and CPSPs that Kupid uses to mutate these resources.

Why is this needed:

Today Kupid receives every request in the cluster, while it only wishes to mutate specific resources (like etcd statefulset) based on resource labels. This allows for low resource consumption by Kupid (by avoiding irrelevant requests to it) and reduces log load by getting rid of unnecessary Handling request... logs.

Support full strategic merge patch

What would you like to be added:
If there is a conflict between the scheduling criteria (potentially more than one) being injected and what is already present in the target pod spec/template, merge done by kupid is currently ad hoc.

It is desirable to support full strategic merge patch in such a cases.

Why is this needed:
Consistency with Kubernetes best-practicies.

Pending tasks for gardener integration

What would you like to be added:

General solution for reserving excess capacity (both on common and dedicated nodes). A tentative solution is here.
Genera solution in the landscape deployment to create dedicated worker pools for etcd and to reserve excess capacity.

Why is this needed:
To complete the integration with gardener.

Allow configuring QPS and burst via helm chart

What would you like to be added:
User should be allowed to configure the QPS and burst via the kupid helm chart.

Why is this needed:
Currently, kupid provides flags which users can use to set QPS and burst settings for the manager config. But these are not exposed via the helm chart. Helm chart needs to be enhanced to allow setting these values.

Drop Kupid in favor of an alternative (OPA Gatekeeper or Kyverno...)

What would you like to be added:

Readme says:

The OPA Gatekeeper allows to define policy to validate and mutate any kubernetes resource. Technically, this can be used to dynamically inject anything, including scheduling policy into pods. But this is too big a component to introduce just to dynamically inject scheduling policy. Besides, the policy definition as code is undesirable in this context because the policy itself would be non-declarative and hard to validate while deploying the policy.

However, it doesn't seem this justifies building our own component (which is currently unmaintained?) in comparison to the relatively low effort to reuse a well-established project from the community.

This repository could basically be a few yaml files instead of thousands of lines of code.

Why is this needed:

relieve us from unnecessary maintenance effort (see open PRs, repository requires regular dependency updates, ref #32, see open dependabot vulnerability alerts)
OPA Gatekeeper will open new doors for many other mechanisms (e.g. mutating specific shoot control planes)

Log the first mutation done to affinity rules by kupid

What happened:
When Kupid mutates Affinity rules for a resource which originally has no affinity rules then that mutation is not logged. This is required for better diagnostics

What you expected to happen:
All mutations (if any) done by kupid to a resource for node affinity changes should be logged.

How to reproduce it (as minimally and precisely as possible):
Create a new STS and ensure that it is a target for kupid to inject affinity rules. You will see that kupid injects the rules as defined in ClusterPodSchedulingPolicy resource but it will not log the mutation for the first change.

Support for k8s 1.22+

What would you like to be added:
Support for Kubernetes 1.22+

Why is this needed:
With v1.22, Kubernetes dropped support for beta versions of the ValidatingWebhookConfiguration and MutatingWebhookConfiguration apis from admissionregistration.k8s.io/v1beta1, as these have moved to v1 now. It's the same case for CustomResourceDefinition api from apiextensions.k8s.io/v1beta1, which has also progressed to v1. More details can be found in the k8s 1.22 API changes page.

/assign

Kupid should not generate the MutatingWebhookConfiguration with the rule to mutate Jobs for updates

What happened:
Define a Job which has a selector for labels of Pod.
Define a Cluster or Namespace scoped PodScheduling policy using Kupid which works on the same labels which satisfies the Pod created by the above Job definition.

Once the Job has completed, change the PodScheduling policy by changing the label selector.
Try to delete the Job created before the PodScheduling policy was changed.
Once the new policy comes into affect, it stops the existing job to be deleted as it will try to update the PodSpec with the new Labels which is an immutable field.
We see following errors in KCM logs

I1207 11:25:10.408731       1 garbagecollector.go:529] remove DeleteDependents finalizer for item [batch/v1/Job, namespace: shoot--dev--test-ash-3, name: a3fc17-compact-job, uid: 728ec193-2a8e-4f8a-befb-1fa9526ef7f8]
E1207 11:25:10.431940       1 garbagecollector.go:309] error syncing item &garbagecollector.node{identity:garbagecollector.objectReference{OwnerReference:v1.OwnerReference{APIVersion:"batch/v1", Kind:"Job", Name:"a3fc17-compact-job", UID:"728ec193-2a8e-4f8a-befb-1fa9526ef7f8", 
BlockOwnerDeletion:(*bool)(0xc001d277fa)}}}: Job.batch "a3fc17-compact-job" is invalid: spec.template: Invalid value: core.PodTemplateSpec: field is immutable

This keep the job orphaned unless manually deleted.

What you expected to happen:
Kupid should not update the spec of a Job as it runs to completion.

How to reproduce it (as minimally and precisely as possible):
Follow the steps above.

Anything else we need to know:
This happens only where the earlier Job is not deleted before creating the change in the PodScheduling Policy.

Environment:
k8s - v1.19.5
kupid - v0.1.6

Missing priorityClass for kupid-extension

What happened:
The kupid extension deployment is lacking a priority class. If the extension is running in a cluster with limited capacity existing Shoots which require the kupid extension can't be reconciled as other components (e.g. control plane components) might have a higher priority.

Similar to all other extensions I recommend to use a priority class with value 1000000000.
Ref: https://github.com/gardener/gardener-extension-provider-aws/blob/master/charts/gardener-extension-provider-aws/templates/priorityclass.yaml#L5

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:

Environment:

/assign

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.