Comments (13)
Meeting minutes from an out-of-band discussion between myself, @vlerenc, @unmarshall , @ishan16696 , @renormalize , @seshachalam-yv, @ashwani2k :
- Agreed to remove
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
for compaction job pods, as suggested by @vlerenc - Agreed to remove
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
for multi-node etcd pods, as suggested by @vlerenc - For single-node etcd pods, we discussed the following:
- CA does not select a node for drain if the node runs at least one pod with the annotation
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
set on it. So there is no deterministic way for an agent to figure out if a node is "to be" drained in order to decide to kick the etcd pods off of it during their respective maintenance windows. The hope is that kicking the etcd pods will allow it to be scheduled to a node with larger utilisation, thanks to bin-packing. - Another option is to set a PDB with
minAvailable: 1
for single-node etcds and change this tominAvailable: 0
during maintenance window for each etcd. But CA considers PDB disruptions for the pods running on the node while selecting the node to drain. So even with this approach, we face the same issue of non-overlapping maintenance windows, and having to select the node to drain and then kick the etcd pods off one by one during their respective maintenance windows. Additionally, the PDB would block MCM from draining nodes during node rollouts, effectively blocking them or forcing MCM to forcefully shut down machines. - Selecting a node for drain should be done in such a way that it doesn't interfere with CA's criteria for selecting a node for drain, based on node usage. For instance, if CA is configured to drain a node which is only 20% utilised, then our custom agent needs to select the node when it is at 40% usage, taints the node, and then starts kicking etcd pods off voluntarily during their maintenance windows.
- @unmarshall suggested co-locating etcd pods with same/similar maintenance windows onto the same nodes, so that the maintenance windows for the etcds on a node to be drained would more or less fall into the same time period, allowing for an agent to remove the safe-to-evict annotation and kick all the etcd pods off at the same time.
- Another suggestion is to entirely move to multi-node etcd for
etcd-main
, even for non-HA control planes, and entirely get rid ofcluster-autoscaler.kubernetes.io/safe-to-evict: "false"
, possibly based on shoot purpose (testing, production, etc). This way, we allow CA to drain nodes from etcd worker pools, without any single-node etcd pods to block them.
- CA does not select a node for drain if the node runs at least one pod with the annotation
Action items:
- @vlerenc to check stats for switching from single-node to multi-node etcds for non-HA control planes, and check if cost savings with nodes would justify the extra costs for running 2 extra replicas of etcd-main (memory, CPU, disk). The main cost differential seems to be with etcd disks, especially for AWS:
- New AWS etcds pods will soon be created with
25Gi
volumes instead of the previous80Gi
value , via gardener/gardener-extension-provider-aws#933 - Reducing volume size for old etcd pods will require #481 to be implemented, allowing druid to roll volumes
- New AWS etcds pods will soon be created with
from etcd-druid.
Do not add the
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
annotation to the etcd-main pods for the clustered etcd-main statefulset
When evicting a member pod from a clustered etcd, there are a few points to keep in mind:
- All members of the etcd cluster should be functioning correctly. If one etcd member is already down in a 3 member etcd cluster, evicting another member could lead to a quorum loss. While Pod Disruption Budgets (PDBs) will prevent etcd from losing quorum, I still believe it would be better to have a precondition that check all members of the etcd cluster are healthy.
- When evicting a member from the etcd cluster, it's preferable to choose a follower rather than the leader. This helps to avoid unnecessary leader elections.
Terminate the singleton etcd-main statefulset pod
To avoid a worst case scenario i.e long restoration, I would suggest let the backup-restore takes the schedule full snapshot which is also performed during the maintenance window before evicting a single node etcd-main pod.
from etcd-druid.
It will probably not amount to much/anything, but (1) could be done right away (compactor jobs), right?
@ishan16696 How would you suggest to do that? This is about cost savings, cluster-autoscaler, evictions, PDBs. You do not have a real hook into these things, so in order for the above to be implemented, you would need to take control of everything (put the PDB to 3 minAvailable / 0 maxUnavailable and let the druid do the job if it sees the "signs"), right?
But you already run ETCD with a PDB of 2 minAvailable / 1 maxUnavailable (for clustered ETCD), so you as a team already decided to go with a flexible PDB, have you not?
from etcd-druid.
But you already run ETCD with a PDB of 2 minAvailable / 1 maxUnavailable (for clustered ETCD), so you as a team already decided to go with a flexible PDB, have you not?
@vlerenc Yes, but what does flexible mean here? The PDB allows for atmost 1 pod to be down/evicted. So the PDB should most certainly block a pod eviction if another pod is already healthy, shouldn't it?
I still believe it would be better to have a precondition that check all members of the etcd cluster are healthy.
@ishan16696 but where would this be precondition be? Druid doesn't have any say in eviction of an etcd pod or draining of a node. Those events can occur at any time, And druid really can't block it. Well, maybe it's possible, if it puts a finalizer on the pods themselves, and wouldn't remove the finalizer unless it allows the pod to be evicted. This can possibly be another part of the scope for druid-controlled pod-updates maybe? @unmarshall
When evicting a member from the etcd cluster, it's preferable to choose a follower rather than the leader. This helps to avoid unnecessary leader elections.
@ishan16696 if the node being scaled down is the one hosting the leader, then there would definitely be a leader election, no doubt. But the optimisation here would be for druid to manually move leadership from this pod to another, and then unblock the pod for deletion.
from etcd-druid.
Yes, but what does flexible mean here? The PDB allows for atmost 1 pod to be down/evicted. So the PDB should most certainly block a pod eviction if another pod is already healthy, shouldn't it?
@shreyas-s-rao What I mean is that you already allow for 1 voluntary eviction (for clustered ETCDs). So, you made this decision already and are OK with it/that it may happen (in general). All I suggested, because I saw the/that PDB, was to drop the CA annotation that only influences CA. You already OKed a voluntary evictions in general in the PDB.
Or in other words: If there are general concerns with voluntary evictions (what PDBs do - which would be respected during scale-in of a node up until the timeout), why did we define a PDB allowing a voluntary eviction of 1?
from etcd-druid.
Yes, I would guess so. Without this flexibility of allowing 1 replica to be evicted, it would be a nightmare to scale down any nodes at all, irrespective of the CA annotation.
from etcd-druid.
Do not add the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation to the etcd-main-compactor jobs (just because it's in the etcd-main spec)
Do not add the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation to the etcd-main pods for the clustered etcd-main statefulset
For a multi-node etcd cluster there is no need for this annotation to be present since we already have a PDB defined which allows 1 eviction at the most. Today for a single node etcd cluster we set minAvailable
to 0 so the annotation today prevents a single pod from being evicted and causing a downtime so here we need to be careful.
But the optimisation here would be for druid to manually move leadership from this pod to another, and then unblock the pod for deletion.
Do we really need this? Leader election can happen at any time during a lifecycle of an etcd cluster. So how is leader election caused due to CA scaling down a node any different?
I still believe it would be better to have a precondition that check all members of the etcd cluster are healthy.
So i checked the way disruption controller computes available pods. It checks the ready condition of the Pod to determine. See countHealthyPods where it checks the ready condition and based on the current healthy, desired heathy etcd it updates PDB status. Now the ready condition of the Pod is determined by ready conditions for all of its containers which is set here in the kubelet. For each individual container a worker is created whose run is called as a go-routine which periodically runs probes. The result of the readiness probe is then used to update the container status which then collectively across all containers constitute the pod readiness. Well it seems i went down a rabbit hold just to make a point. So the point is that the readiness
probe defined on every etcd member today is a cluster readiness and not pod
readiness. So if your readiness probe says that it is not ready because there is no quorum even if you have all 3 pods in a 3 member HA etcd cluster, then ideally the PDB should prevent even a single node from being evicted, right?
from etcd-druid.
In an out-of-band discussion between myself, @unmarshall , @ishan16696 , @seshachalam-yv and @renormalize , we concluded the following:
- It makes sense to remove annotation
cluster-autoscaler.kubernetes.io/safe-to-evict: false
for multi-node etcds. This change needs to be done on gardenlet side when it creates theEtcd
resource, since druid simply sees the annotation and sets it on the pod. - Druid should not bother about terminating pods, since it has no knowledge about shoot purpose or shoot maintenance windows. Instead, gardenlet should be the one to take care of this, since it is the one with full knowledge about this. It is not semantically correct for druid to voluntarily delete perfectly well-running pods.
i. Gardenlet can do this by assessing nodes that are stuck in drain due to a blocking pod that has the annotationcluster-autoscaler.kubernetes.io/safe-to-evict: false
set on it, and remove such a pod. Or better yet, it should remove the annotation during the maintenance window and add it back at the end of the maintenance window, depending on shoot purpose. Druid's job is to simply reconcile the etcd and add/update/delete annotations based on what is set in theEtcd
resource. - Druid could still provide a trigger point for gardenlet to trigger deletion of a pod, like a PodDeletionTask, but this might be an overkill for a simple task of just deleting a pod.
@vlerenc PTAL.
from etcd-druid.
Gardenlet can do this by assessing nodes that are stuck in drain due to a blocking pod that has the annotation cluster-autoscaler.kubernetes.io/safe-to-evict: false set on it, and remove such a pod.
Why would that happen? The annotation only influences the CA, nothing else, no?
from etcd-druid.
Just a side note on this topic:
I still believe it would be better to have a precondition that check all members of the etcd cluster are healthy.
So i checked the way disruption controller computes available pods. It checks the ready condition of the Pod to determine.
This is true, until you set unhealthyPodEvictionPolicy: AlwaysAllow
in your PDB. This was specifically introduced to help get out of situations where you do want eviction to happen, although Pods are currently unhealthy, for example when an OOM occurs.
We do that for many of the gardener managed components, but not for etcd, AFAIK.
from etcd-druid.
This is true, until you set unhealthyPodEvictionPolicy: AlwaysAllow in your PDB.
Thanks for sharing this as i was unaware of this feature. Unhealthy translates to ready condition being false. For an etcd cluster if there is a loss of quorum then the ready condition for all pods will change to false. This then allows even the healthier pod (1 out of 3) to also be evicted. So in case of a permanent quorum loss this would not have much side effect. But i wondering how will this impact transient quorum loss situations. Today we do not set unhealthyPodEvictionPolicy
in the PDB for etcd and IMHO we should continue to not set this field at all which then defaults to IfHealthyBudget
which is what we currently have as well. Just removing the CA annotation cluster-autoscaler.kubernetes.io/safe-to-evict
is sufficient. If the ready condition is false which means there is a quorum loss then an operator should react to an alert and ascertain if it is transient of permanent quorum loss. If we set unhealthyPodEvictionPolicy
to AlwaysAllow
then it can potentially result in eviction of all etcd members which we do not want.
from etcd-druid.
PDB behavior and CA behavior are meant to tackle two different problems entirely. One aims to maintain minimum desired replicas for multi-node etcds, in order to keep them quorate, irrespective of what state the underlying nodes are in. Hence PDBs are necessary for multi-node etcds, and CA does not adversely affect the quorum of a multi-node etcd. Now, setting the field unhealthyPodEvictionPolicy
for the PDB here can easily lead to downtimes for the etcd, especially during node rollout, because now we allow the PDB to go ahead and allow eviction of more than minAvailable
number of pods. This seems counter-intuitive.
On the other hand, CA aims to control node lifecycle, including draining of nodes that run single-node etcds on them, and expects that an etcd pod should not block the draining of a node indefinitely. PDBs play no part for single-node etcds, since a single-node etcd pod must be rolled sometime or another, leading to a temporary unavailability of the etcd. Here, setting the unhealthyPodEvictionPolicy
makes little difference, because at the end of the day, the main issue CA faces is that a healthy single-node etcd pod running on a node cannot be evicted because of the annotation cluster-autoscaler.kubernetes.io/safe-to-evict
set on it, so PDB has no part to play here. Instead, it should be the responsibility of the one who blocked the pod eviction (gardenlet) to unblock pod eviction during node drain, and then re-block it after the pod has been relocated to a new, already-rolled node.
from etcd-druid.
At the moment, it wouldn't be worth switching HA on for all etcd-main
s that belong to non-eval/non-HA clusters (etcd-events
would remain a singleton for non-HA clusters and doesn't run with the safe-to-evict: false
annotation anyway on the regular worker pool). Here data from some test landscapes:
Read 11186 ETCDs (of which 1882 have the purpose evaluation and are not HA; hibernated clusters were ignored).
Read 96 seeds.
Proposal "All but eval clusters become HA, so that the ETCD worker pool and the safe-to-evict annotation can be dropped": Replicas:
- ETCD Main: 3 replicas for HA and non-evaluation clusters, otherwise 1
- ETCD Events: 3 replicas for HA clusters, otherwise 1 (as today)
Proposal "All but eval clusters become HA, so that the ETCD worker pool and the safe-to-evict annotation can be dropped": Number of clusters with N replicas (today vs. proposed):
- ETCD Main:
- 1 replicas: 3871 today vs. 941 proposed requesting 1059 cores, 5.1T memory, 21K EUR today vs. 172 cores, 1.1T memory, 4.6K EUR proposed
- 3 replicas: 1722 today vs. 4652 proposed requesting 1768 cores, 7.9T memory, 33K EUR today vs. 4428 cores, 19.8T memory, 83K EUR proposed
- ETCD Events:
- 1 replicas: 3871 today vs. 3871 proposed requesting 372 cores, 2T memory, 8.6K EUR today vs. 372 cores, 2T memory, 8.6K EUR proposed
- 3 replicas: 1722 today vs. 1722 proposed requesting 407 cores, 2.2T memory, 9.3K EUR today vs. 407 cores, 2.2T memory, 9.3K EUR proposed
Proposal 1 "All but eval clusters become HA, so that the ETCD worker pool and the safe-to-evict annotation can be dropped": Status quo and predictions:
- ETCDs Today: 2827 cores, 12.9T memory, 54K EUR
- Pools Today: regular at 73% efficiency = 352K EUR and etcd at 56% efficiency = 104K EUR
- ETCDs Proposed: 4601 cores, 20.9T memory, 88K EUR
- Pools Proposed: regular at 72% efficiency = 478K EUR and etcd dropped
- Difference: +22K EUR
That's all guessing. I was assuming that the etcd-main
s will be placed with the same requests-to-allocatable ratio on the regular worker pool as also all other pods on that pool. In some cases, the ratio may actually improve (to the advantage of the proposal), but then the predicted costs above neglect the daemonsets for the additional nodes (to the disadvantage of the proposal). Given that the difference is 22K (significant) and we haven't even counted disk and network costs, it seems cost-prohibitive to run etcd-main
generally in HA for non-eval/non-HA clusters.
Then again, most ETCDs still have HVPA configured (HVPA is removed only for shoots on seed canary
/aws-ha-eu1
and would never scale down if marked productive) with extremely wasteful requests. When I rerun the calculation with the current VPA uncapped target recommendation:
Read 11186 ETCDs (of which 1882 have the purpose evaluation and are not HA; hibernated clusters were ignored).
Read 96 seeds.
Proposal 1 "All but eval clusters become HA, so that the ETCD worker pool and the safe-to-evict annotation can be dropped": Replicas:
- ETCD Main: 3 replicas for HA and non-evaluation clusters, otherwise 1
- ETCD Events: 3 replicas for HA clusters, otherwise 1 (as today)
Proposal "All but eval clusters become HA, so that the ETCD worker pool and the safe-to-evict annotation can be dropped": Number of clusters with N replicas (today vs. proposed):
- ETCD Main:
- 1 replicas: 3871 today vs. 941 proposed requesting 1059 cores, 5.1T memory, 21K EUR today vs. 79 cores, 663.5G memory, 2.8K EUR proposed
- 3 replicas: 1722 today vs. 4652 proposed requesting 1768 cores, 7.9T memory, 33K EUR today vs. 1495 cores, 13.7T memory, 58K EUR proposed
- ETCD Events:
- 1 replicas: 3871 today vs. 3871 proposed requesting 372 cores, 2T memory, 8.6K EUR today vs. 106 cores, 750.1G memory, 3.1K EUR proposed
- 3 replicas: 1722 today vs. 1722 proposed requesting 407 cores, 2.2T memory, 9.3K EUR today vs. 209 cores, 1.4T memory, 5.9K EUR proposed
Proposal "All but eval clusters become HA, so that the ETCD worker pool and the safe-to-evict annotation can be dropped": Status quo and predictions:
- ETCDs Today: 2827 cores, 12.9T memory, 54K EUR
- Pools Today: regular at 73% efficiency = 352K EUR and etcd at 56% efficiency = 104K EUR
- ETCDs Proposed: 1574 cores, 14.4T memory, 60K EUR
- Pools Proposed: regular at 72% efficiency = 431K EUR and etcd dropped
- Difference: -25K EUR
So, in this case, we would actually reduce our costs by 25K, but that's comparing apples to pears potatoes. Yes, even a single pod will prevent node scale-in, but realistically, we will require a lot less nodes in the ETCD worker pools with the new recommendations over time, because of bin-packing.
So, right now, we don't have the data/confidence to decide, one way or another, i.e. at the moment we would neither decide to make all etcd-main
s generally HA for non-eval/non-HA clusters nor would we decide to invest into a more complex solution as depicted above.
What we can do: Go forward with the HVPA removal and clarify the downscaling question that @voelzmo will bring up:
- HA ETCDs can probably be safely downscaled at all times
- Non-HA ETCDs can probably be downscaled at least during the maintenance time window (even if marked productive)
Once we have that in, we can collect the data and run the prediction again. Then we have a proper foundation to decide what we are going to do with the ETCD worker pool and the safe-to-evict: false
annotations.
WDYT?
from etcd-druid.
Related Issues (20)
- [Feature] Consume GCS regional endpoints feature from etcd-backup-restore
- [Feature] Enable Druid to use mock GCS server to run e2e tests
- [Feature] Enable `etcd-druid` to make use of the `Azurite` emulator to run end-to-end (e2e) tests HOT 1
- [Feature] Move `UseEtcdWrapper` alpha feature to beta feature HOT 2
- [Feature] Configurable `--max-backups` parameter for etcdbrctl garbage collection HOT 1
- Make max-backups configurable in LimitBased GC from Etcd CR
- [BUG] violates PodSecurity "baseline:latest" HOT 1
- Add a new configurable field `fullSnapshotLeaseUpdateInterval` in spec.backup section of Etcd CR HOT 2
- [Backlog] Upgrade kustomize to v5.0.0+
- Improve Node Utilisation / More Permissive CA Down-Scaling (safe-to-evict) HOT 2
- [Feature] Move UseEtcdWrapper beta feature to GA feature
- Upgrade golang version to 1.22
- Enhance and Stabilise Druid E2E tests
- Improve Node Utilization by Reducing Requests for Druid-Managed Pods HOT 9
- Clashing VPA dependencies between etcd-druid and gardener/gardener
- [TASK] Remove deprecated fields from Etcd API
- Handling orphaned deployed components in case ETCD resources is deleted by mistakenly removing finalizer HOT 1
- Use `fsGroup` instead of initContainer for setting appropriate file owners. HOT 2
- Introduce etcd-druid upgrade tests
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from etcd-druid.