See "Garbage Collection" section in <a class="issue-link js-issue-link" data-error-tex

Swarm node garbage collection about deploykit HOT 4 CLOSED

kaufers commented on June 8, 2024

Swarm node garbage collection

from deploykit.

Comments (4)

chungers commented on June 8, 2024

Without any additional changes to the enrollment controller... can we run two instances of enrollment controller, one is SWARM-GRP-READY=>SWARM-INST-ALL and the other is SWARM-GRP-READY => VM-INST-ALL. Not worrying about excessive polling for a second, and suppose that we add a removal policy, wouldn't these two combinations deal with all the cases?

If this is workable, we can avoid excessive polling by adding a caching instance plugin similar to how TF was implemented... this can then only query at X second intervals and any queries between samples will just use cached values.

Then we need to handle timeouts. We can add a policy to the enrollment controller... something that says RemoveAfterAdditionalAbsences=N -- where N defaults to 0 meaning that a Destroy is issued on the first absence (0 additional times). A N > 0 implies N*polling_interval of time tolerance from the first absence of an entry. The implementation is a bit tricky as we need to reset a counter that is keyed by the instance entry as soon as it comes back in the group's entries... but it seems pretty mechanical and generic.

Am I missing anything?

from deploykit.

kaufers commented on June 8, 2024

The problem is that the different scenarios are not unique. For example:

`SWARM-GRP-READY`=>`SWARM-INST-ALL`

Scenarios 1 (orphan) and 3 (node down) present the same:

SWARM-GRP-READY   SWARM-INST-ALL
n1-link1          n1 (ready)
                  n2 (down)
n3-link3          n3 (ready)

`SWARM-GRP-READY`=>`VM-INST-ALL`

Scenarios 2 (join failure) and 3 (node down) present the same:

SWARM-GRP-READY   VM-INST-ALL
n1-link1          n1-link1
                  n2-link2
n3-link3          n3-link3

In this case, whatever timeout value we assign to the scenario will always handle the processing.

It seems like we need the data from both VM-INST-ALL and VM-SWARM-ALL in order to uniquely identify the scenario.

I'm wondering how generic a problem this really is (in order words, do we need to use the generic enroller to handle this?). Couldn't we create a SwarmConsistency controller that has timeout values for the different scenarios?

The logic would be something like:

for node in union(VM-INST-ALL, SWARM-INST-ALL):
  if node in SWARM-GRP-READY:
    continue // healthy node
  if node in SWARM-INST-ALL and node not in VM-INST-ALL:
    ...do ophan logic...
  else if node not in SWARM-INST-ALL and node in VM-INST-ALL:
    ...do join failure logic...
  else
    ...do node down logic...

With this solution we don't need the group/instance plugins for the swarm nodes. The new controller could just get the nodes via the docker client directly. Thoughts?

from deploykit.

chungers commented on June 8, 2024

The Rouge node case is tricky... The most we can do is to force docker node rm, but I don't think we can safely delete any instances, nor do we want to. This is because there is no way we can tell that some instance that comes back from the Instance plugin that doesn't have a link tag can ever be safely or correctly correlated to a swarm node. If you have two instances then there's no way to know which one is really running the docker engine... the only way is to connect to the instance and see if the engine returns an id that matches any of the entries in the swarm.

If all we can do in this case is to force docker node rm, then we are effectively reducing the capacity of the cluster, because as far as infrakit is concerned, the group has the size specified at some point. So docker node rm will not cause additional nodes to be provisioned. I think if these rouge nodes do appear -- which are entirely possible because provisioning can always be done manually or through other means -- we should leave them alone. Or, at the very least, removing them as swarm nodes should be a matter of policy...

from deploykit.

kaufers commented on June 8, 2024

The most we can do is to force docker node rm, but I don't think we can safely delete any instances, nor do we want to.

I agree, if anything we'd just remove them from the swarm. And you're right, if we have 2 nodes that have the same link ID then we have no why to know which is the "correct" node.

I was more thinking of the case where someone followed the UCP steps and ran the docker join command on another system that Infrakit is not managing.

Or, at the very least, removing them as swarm nodes should be a matter of policy...

Yeah, I think that this is really a corner case and that we can address it later (if at all).

from deploykit.

Swarm node garbage collection about deploykit HOT 4 CLOSED

Comments (4)

`SWARM-GRP-READY`=>`SWARM-INST-ALL`

`SWARM-GRP-READY`=>`VM-INST-ALL`

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (4)

SWARM-GRP-READY=>SWARM-INST-ALL

SWARM-GRP-READY=>VM-INST-ALL

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

`SWARM-GRP-READY`=>`SWARM-INST-ALL`

`SWARM-GRP-READY`=>`VM-INST-ALL`