Deion of the problem Extra volumes are not deleted during cl

Note: <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I located logs from 2019-05-20, as volumes listed in deion of this issue belongs

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Additional AWS volumes not deleted when using --destroy option when cluster name is longer than 21 characters about ocs-ci HOT 12 CLOSED

red-hat-storage commented on June 24, 2024

Additional AWS volumes not deleted when using --destroy option when cluster name is longer than 21 characters

from ocs-ci.

Comments (12)

mbukatov commented on June 24, 2024

Note: @dahorak seen this in CI today as well.

from ocs-ci.

clacroix12 commented on June 24, 2024

@mbukatov do you have logs for the run? We should be logging the volumes we found with the pattern we are using. Would be helpful to see if there's a disconnect between your cluster name and what we're using for pattern matching the volumes.

from ocs-ci.

mbukatov commented on June 24, 2024

@clacroix12 Do you mean logs from the installation or destroy run? I keep the logs from installation only. Edit: Ah, it should be good enough. I will locate and check the logs tomorrow.

from ocs-ci.

clacroix12 commented on June 24, 2024

@mbukatov also are you sure that those volumes are for this exact run? I just spun up a cluster and the name of my volumes includes the random id that we generate.

ocs-ci cmd:

python run.py --suite suites/ocs_basic_install.yml --cluster-name clacroix-test --no-email --no-destroy

worker instance names:

clacroix-test-65942-qbk7z-worker-us-east-2b-d2gfh
clacroix-test-65942-qbk7z-worker-us-east-2a-rz9tt
clacroix-test-65942-qbk7z-worker-us-east-2c-dds5h

extra volume names:

clacroix-test-65942-qbk7z-worker-us-east-2b-d2gfh_extra_volume
clacroix-test-65942-qbk7z-worker-us-east-2a-rz9tt_extra_volume
clacroix-test-65942-qbk7z-worker-us-east-2c-dds5h_extra_volume

Notice how both have 65942 after the name I provided with --cluster-name. Your extra volumes do not include this string, which makes me think they're actually not the volumes created with that installation.

I ran the same commands as you, only changing the provided cluster name, and all volumes were destroyed. I verified from my logs that the pattern being used to find volumes was clacroix-test-65942* and all three of my extra volumes were found.

destroy cmd:

python run.py --log-level DEBUG --cluster-path /tmp/clacroix-test-65942 --destroy

I was not able to reproduce this issue.

from ocs-ci.

nehaberry commented on June 24, 2024

@clacroix12 +1

The destroy cluster worked for me and deleted the 3 extra volumes as part of cleanup. Hence, even I was not able to reproduce this issue

This was the command I used to create cluster
python run.py --cluster-name=nberry-may17 --suite=suites/ocs_basic_install.yml --cluster-path=/home/nberry/aws-install/may21 --no-email --no-destroy

Command to destroy cluster
python run.py --destroy --cluster-path /home/nberry/aws-install/may21/

The extra volumes are deleted:

nberry-may21-dsltb-worker-us-east-2a-9ns2k_extra_volume
nberry-may21-dsltb-worker-us-east-2b-k55kt_extra_volume
nberry-may21-dsltb-worker-us-east-2c-7sr59_extra_volume

from ocs-ci.

dahorak commented on June 24, 2024

I have theory, why it doesn't work for me once ... but I wasn't able to install the cluster since then, so I wasn't able to reproduce it yet..
If I'll be able to reproduce it, in my case it will be probably connected with the length of the cluster name.

from ocs-ci.

dahorak commented on June 24, 2024

I was able to reproduce it and it and as I've write above, the reason in my case might be because of long cluster name:
The cluster is created using following command:

 python3 run.py --log-level DEBUG --suite suites/ocs_basic_install.yml --cluster-name ocs-ci-jenkins-dahorak-cluster --cluster-path /tmp/ocs-ci-jenkins-dahorak-cluster --no-destroy

And should be destroyed via this command:

python3 run.py --log-level DEBUG --cluster-path /tmp/ocs-ci-jenkins-dahorak-cluster --destroy

But in AWS I see, that it shrink the name of the cluster, so for example
instances are named this way:

ocs-ci-jenkins-dahora-bvxbp-master-0
ocs-ci-jenkins-dahora-bvxbp-master-1
ocs-ci-jenkins-dahora-bvxbp-master-2
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2a-kmvqj
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2b-fkcsw
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2c-xrcmk

And after the destroy command, following 3 volumes remain in the AWS:

ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2a-kmvqj_extra_volume  vol-03b9b83a1ef047e70 100 GiB gp2 300 May 22, 2019 at 10:43:30 AM UTC+2 us-east-2a  available None  Okay  Not Encrypted
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2b-fkcsw_extra_volume  vol-059392d8112372476 100 GiB gp2 300 May 22, 2019 at 10:43:18 AM UTC+2 us-east-2b  available None  Okay  Not Encrypted
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2c-xrcmk_extra_volume  vol-0fb20441fb036b461 100 GiB gp2 300 May 22, 2019 at 10:43:24 AM UTC+2 us-east-2c  available None

The full output of the destroy command is here:
https://gist.github.com/dahorak/59e1a4d8e7c31ac01773204f7855f6d4

from ocs-ci.

mbukatov commented on June 24, 2024

I located logs from 2019-05-20, as volumes listed in description of this issue belongs to cluster I created on on this date.

This can be demonstrated via:

$ cd cluster-2019-05-20/
$ grep mbukatov-ocs4-testing-fmc7p-worker-us-east-2a-9hr9r_extra_volume cephci-run-1558358218636/install_OCS_0.log | wc -l
15

And 1st line when creation of the volume is logged:

2019-05-20 15:51:30,404 - botocore.endpoint - DEBUG - Making request for OperationModel(name=CreateVolume) with params: {'url_path': '/', 'query_string': '', 'method': 'POST', 'headers': {'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8', 'User-Agent': 'Boto3/1.9.143 Python/3.7.3 Linux/5.0.13-300.fc30.x86_64 Botocore/1.12.143'}, 'body': {'Action': 'CreateVolume', 'Version': '2016-11-15', 'AvailabilityZone': 'us-east-2a', 'Encrypted': 'false', 'Size': 100, 'VolumeType': 'gp2', 'TagSpecification.1.ResourceType': 'volume', 'TagSpecification.1.Tag.1.Key': 'Name', 'TagSpecification.1.Tag.1.Value': 'mbukatov-ocs4-testing-fmc7p-worker-us-east-2a-9hr9r_extra_volume'}, 'url': 'https://ec2.us-east-2.amazonaws.com/', 'context': {'client_region': 'us-east-2', 'client_config': <botocore.config.Config object at 0x7f7c268154e0>, 'has_streaming_input': False, 'auth_type': None}}

Daniel's observation about long cluster name is good catch. As you can see, this applies to my case as well. Full name of my cluster was mbukatov-ocs4-testing-cluster-88528 as can be seen here:

2019-05-20 15:16:58,818 - utility.utils - INFO - Executing command: mkdir -p /tmp/mbukatov-ocs4-testing-cluster-88528

while the volume name prefix is just mbukatov-ocs4-testing.

It seems to me that we should delete the extra volumes using cluster_id instead of cluster_name, because that is what is actually used as a prefix in name of all AWS resources. We can also consider rejecting cluster name which is too long.

from ocs-ci.

clacroix12 commented on June 24, 2024

@mbukatov @dahorak It looks like either terraform or openshift-install is truncating the name if it's too long so this does actually create an issue if we're using the name we think the volumes will have to pattern match them. I like the idea of using the cluster_id instead. We could also limit the name length so we don't end up with a cluster directory name that doesn't match the corresponding cluster in aws. @petr-balogh thoughts?

from ocs-ci.

petr-balogh commented on June 24, 2024

@clacroix12 yep, I think that we should propagate this cluster_id from cluster directory to some ENV_DATA and we can use this cluster_id for deleting extra volumes which will solve the issue. About cluster name length I think we should also limit it to some length as we are also adding some random numbers CID to the name. Currently CID is int with length 5 digits, we can decrease it to 3 chars which will be containing letters and digits to have more combination.

from ocs-ci.

clacroix12 commented on June 24, 2024

@petr-balogh the purpose of the 5 digits we add to the cluster name was to ensure that the directory created was going to be unique. This is because by default all cluster directories are going to /tmp. However, if we started placing the cluster directory inside of the "run directory" that we create for logs / potentially other things for the entire automation run, we wouldn't need the 5 digit cid anymore. Since we would have a new parent directory for each individual automation run, we wouldn't need to worry about uniqueness at the cluster directory level.

Now cluster dir exists here:

/tmp/my-cluster-name-12345

Proposed solution:

/logs-ceph-mount-point/ocsci-jenkins/run-1559079670/my-cluster-name-12345

Note that run-1559079670 is an example of what our "run" directory would be. Basically run-<timestamp> or something similar, ensuring uniqueness between automation runs.

Long story short, we can cut the 5 digits out of the equation if we wanted to use the run directory as the parent for cluster directories. This obviously doesn't cover the case where the user provides a cluster_path at the CLI level, but we could always do something like enforce that the directory is already created before specifying it.

from ocs-ci.

vasukulkarni commented on June 24, 2024

should be fixed, old one.

from ocs-ci.

Additional AWS volumes not deleted when using --destroy option when cluster name is longer than 21 characters about ocs-ci HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent