The terraform plugin supports defined related resources (for example, a NFS volume for

Per Issue <a class="issue-link js-issue-link" data-error-text="Failed to load title" d

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Cascading deletes using the terraform plugin about deploykit HOT 3 CLOSED

kaufers commented on June 8, 2024

Cascading deletes using the terraform plugin

from deploykit.

Comments (3)

chungers commented on June 8, 2024

Per Issue #838 and PR #839 -- the leader node will be terminated as the very last step (export INFRAKIT_GROUP_POLICY_SELF_UPDATE=last, which is also the default behavior -- see https://github.com/docker/infrakit/blob/master/pkg/run/v0/group/group.go#L70), the leader will be terminated as the very last node in the rolling update. Please verify this behavior.

This will address 1. of above. If 1. is guaranteed, the next step is to ensure we can properly terminate the vm and all of its resources in a predictable way -- since the "self" node can shut down at any time due to the vm termination and Terraform apply could be mid-flight and potentially leaving Terraform files on disk in a corrupted state.

from deploykit.

chungers commented on June 8, 2024

How we can delete the vm and its associated resources in a way that can be tolerant to terraform apply being interrupted mid-flight due to the self node being shutdown?

Thinking through how Terraform works... I wonder if this can be done at all... If the self node is terminated as part of terraform apply, that process will just die mid-flight. Will this leave the terraform state files on disk in a corrupted state? If we know that terraform at least guarantees file / state consistency at the per-resource granularity, then we could do something with creating tombstones of the resources we need to delete:

Determine a list of resources that needs to be terminated per instance destroy (the vm instance, the volumes).
Create a folder on disk for the 'delete' operation... for example delete-<timestamp>.
In this directory, create symlinks to all the files to be deleted.
At the top level directory, change a symlink (eg. delete-current to point to this new directory).
After the symlink is created, start deleting every files in the delete-current directory.
Now call terraform apply. Terraform will start deleting resources and update its state file as it proceeds (or maybe wait for everything to be deleted then 'commits').
The node running the terraform apply is terminated. Everything goes out.
At this point, other running manager nodes detects the current leader just went offline. A new round of leader election takes place and a new leader (now already updated node) takes over.
The new leader starts up.
The new leader looks at the terraform state files on its disk (which is shared / global mount amongst the managers). It makes sure that all the symlinks in the delete-current directory point to no files... If any symlink resolves (os.Readlink()), it should remove the linked file.
The new leader (its terraform plugin) now calls another terraform apply again.
Terraform apply now runs on the new leader node... and reconciles the infra resources with the on-disk files.

The big assumption here is that any files that Terraform writes (its own state files -- not the ones we create/delete) do not get corrupted mid-flight. This is a pretty big assumption. Is there a way you can verify @kaufers ?

If we don't want to make this assumption or don't trust what is said on the tin, then we would have to do something more coordinated. See my comments on #838

from deploykit.

kaufers commented on June 8, 2024

@chungers I think that what you have for #838 and #839 might actually solve this issue. Today, with the "resource" counting, we remove the "globally" scoped resource files when the last VM that is references them is destroyed. In this case, that means that the terraform apply will include the destroy call for all of the resources (including the self VM).

In my testing on IBM Cloud, the resource destroy API call returns pretty quickly and there is a delay (up to a few minutes) before the actual VM is powered down. This provides plenty of time for all of the resources to be destroyed.

We hit issues when the manager group destroy deletes the current leader first. Once the updates are merged to ensure destroy ordering I'll provide an update to this issue (there may no longer be problems).

from deploykit.

Cascading deletes using the terraform plugin about deploykit HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent