Code Monkey home page Code Monkey logo

Comments (3)

chungers avatar chungers commented on June 8, 2024

Per Issue #838 and PR #839 -- the leader node will be terminated as the very last step (export INFRAKIT_GROUP_POLICY_SELF_UPDATE=last, which is also the default behavior -- see https://github.com/docker/infrakit/blob/master/pkg/run/v0/group/group.go#L70), the leader will be terminated as the very last node in the rolling update. Please verify this behavior.

This will address 1. of above. If 1. is guaranteed, the next step is to ensure we can properly terminate the vm and all of its resources in a predictable way -- since the "self" node can shut down at any time due to the vm termination and Terraform apply could be mid-flight and potentially leaving Terraform files on disk in a corrupted state.

from deploykit.

chungers avatar chungers commented on June 8, 2024

How we can delete the vm and its associated resources in a way that can be tolerant to terraform apply being interrupted mid-flight due to the self node being shutdown?

Thinking through how Terraform works... I wonder if this can be done at all... If the self node is terminated as part of terraform apply, that process will just die mid-flight. Will this leave the terraform state files on disk in a corrupted state? If we know that terraform at least guarantees file / state consistency at the per-resource granularity, then we could do something with creating tombstones of the resources we need to delete:

  1. Determine a list of resources that needs to be terminated per instance destroy (the vm instance, the volumes).
  2. Create a folder on disk for the 'delete' operation... for example delete-<timestamp>.
  3. In this directory, create symlinks to all the files to be deleted.
  4. At the top level directory, change a symlink (eg. delete-current to point to this new directory).
  5. After the symlink is created, start deleting every files in the delete-current directory.
  6. Now call terraform apply. Terraform will start deleting resources and update its state file as it proceeds (or maybe wait for everything to be deleted then 'commits').
  7. The node running the terraform apply is terminated. Everything goes out.
  8. At this point, other running manager nodes detects the current leader just went offline. A new round of leader election takes place and a new leader (now already updated node) takes over.
  9. The new leader starts up.
  10. The new leader looks at the terraform state files on its disk (which is shared / global mount amongst the managers). It makes sure that all the symlinks in the delete-current directory point to no files... If any symlink resolves (os.Readlink()), it should remove the linked file.
  11. The new leader (its terraform plugin) now calls another terraform apply again.
  12. Terraform apply now runs on the new leader node... and reconciles the infra resources with the on-disk files.

The big assumption here is that any files that Terraform writes (its own state files -- not the ones we create/delete) do not get corrupted mid-flight. This is a pretty big assumption. Is there a way you can verify @kaufers ?

If we don't want to make this assumption or don't trust what is said on the tin, then we would have to do something more coordinated. See my comments on #838

from deploykit.

kaufers avatar kaufers commented on June 8, 2024

@chungers I think that what you have for #838 and #839 might actually solve this issue. Today, with the "resource" counting, we remove the "globally" scoped resource files when the last VM that is references them is destroyed. In this case, that means that the terraform apply will include the destroy call for all of the resources (including the self VM).

In my testing on IBM Cloud, the resource destroy API call returns pretty quickly and there is a delay (up to a few minutes) before the actual VM is powered down. This provides plenty of time for all of the resources to be destroyed.

We hit issues when the manager group destroy deletes the current leader first. Once the updates are merged to ensure destroy ordering I'll provide an update to this issue (there may no longer be problems).

from deploykit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.