Code Monkey home page Code Monkey logo

Comments (4)

chungers avatar chungers commented on June 8, 2024

Without any additional changes to the enrollment controller... can we run two instances of enrollment controller, one is SWARM-GRP-READY=>SWARM-INST-ALL and the other is SWARM-GRP-READY => VM-INST-ALL. Not worrying about excessive polling for a second, and suppose that we add a removal policy, wouldn't these two combinations deal with all the cases?

If this is workable, we can avoid excessive polling by adding a caching instance plugin similar to how TF was implemented... this can then only query at X second intervals and any queries between samples will just use cached values.

Then we need to handle timeouts. We can add a policy to the enrollment controller... something that says RemoveAfterAdditionalAbsences=N -- where N defaults to 0 meaning that a Destroy is issued on the first absence (0 additional times). A N > 0 implies N*polling_interval of time tolerance from the first absence of an entry. The implementation is a bit tricky as we need to reset a counter that is keyed by the instance entry as soon as it comes back in the group's entries... but it seems pretty mechanical and generic.

Am I missing anything?

from deploykit.

kaufers avatar kaufers commented on June 8, 2024

The problem is that the different scenarios are not unique. For example:

SWARM-GRP-READY=>SWARM-INST-ALL

Scenarios 1 (orphan) and 3 (node down) present the same:

SWARM-GRP-READY   SWARM-INST-ALL
n1-link1          n1 (ready)
                  n2 (down)
n3-link3          n3 (ready)

SWARM-GRP-READY=>VM-INST-ALL

Scenarios 2 (join failure) and 3 (node down) present the same:

SWARM-GRP-READY   VM-INST-ALL
n1-link1          n1-link1
                  n2-link2
n3-link3          n3-link3

In this case, whatever timeout value we assign to the scenario will always handle the processing.

It seems like we need the data from both VM-INST-ALL and VM-SWARM-ALL in order to uniquely identify the scenario.

I'm wondering how generic a problem this really is (in order words, do we need to use the generic enroller to handle this?). Couldn't we create a SwarmConsistency controller that has timeout values for the different scenarios?

The logic would be something like:

for node in union(VM-INST-ALL, SWARM-INST-ALL):
  if node in SWARM-GRP-READY:
    continue // healthy node
  if node in SWARM-INST-ALL and node not in VM-INST-ALL:
    ...do ophan logic...
  else if node not in SWARM-INST-ALL and node in VM-INST-ALL:
    ...do join failure logic...
  else
    ...do node down logic...

With this solution we don't need the group/instance plugins for the swarm nodes. The new controller could just get the nodes via the docker client directly. Thoughts?

from deploykit.

chungers avatar chungers commented on June 8, 2024

The Rouge node case is tricky... The most we can do is to force docker node rm, but I don't think we can safely delete any instances, nor do we want to. This is because there is no way we can tell that some instance that comes back from the Instance plugin that doesn't have a link tag can ever be safely or correctly correlated to a swarm node. If you have two instances then there's no way to know which one is really running the docker engine... the only way is to connect to the instance and see if the engine returns an id that matches any of the entries in the swarm.

If all we can do in this case is to force docker node rm, then we are effectively reducing the capacity of the cluster, because as far as infrakit is concerned, the group has the size specified at some point. So docker node rm will not cause additional nodes to be provisioned. I think if these rouge nodes do appear -- which are entirely possible because provisioning can always be done manually or through other means -- we should leave them alone. Or, at the very least, removing them as swarm nodes should be a matter of policy...

from deploykit.

kaufers avatar kaufers commented on June 8, 2024

The most we can do is to force docker node rm, but I don't think we can safely delete any instances, nor do we want to.

I agree, if anything we'd just remove them from the swarm. And you're right, if we have 2 nodes that have the same link ID then we have no why to know which is the "correct" node.

I was more thinking of the case where someone followed the UCP steps and ran the docker join command on another system that Infrakit is not managing.

Or, at the very least, removing them as swarm nodes should be a matter of policy...

Yeah, I think that this is really a corner case and that we can address it later (if at all).

from deploykit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.