Comments (12)
Note: @dahorak seen this in CI today as well.
from ocs-ci.
@mbukatov do you have logs for the run? We should be logging the volumes we found with the pattern we are using. Would be helpful to see if there's a disconnect between your cluster name and what we're using for pattern matching the volumes.
from ocs-ci.
@clacroix12 Do you mean logs from the installation or destroy run? I keep the logs from installation only. Edit: Ah, it should be good enough. I will locate and check the logs tomorrow.
from ocs-ci.
@mbukatov also are you sure that those volumes are for this exact run? I just spun up a cluster and the name of my volumes includes the random id that we generate.
ocs-ci cmd:
python run.py --suite suites/ocs_basic_install.yml --cluster-name clacroix-test --no-email --no-destroy
worker instance names:
clacroix-test-65942-qbk7z-worker-us-east-2b-d2gfh
clacroix-test-65942-qbk7z-worker-us-east-2a-rz9tt
clacroix-test-65942-qbk7z-worker-us-east-2c-dds5h
extra volume names:
clacroix-test-65942-qbk7z-worker-us-east-2b-d2gfh_extra_volume
clacroix-test-65942-qbk7z-worker-us-east-2a-rz9tt_extra_volume
clacroix-test-65942-qbk7z-worker-us-east-2c-dds5h_extra_volume
Notice how both have 65942
after the name I provided with --cluster-name
. Your extra volumes do not include this string, which makes me think they're actually not the volumes created with that installation.
I ran the same commands as you, only changing the provided cluster name, and all volumes were destroyed. I verified from my logs that the pattern being used to find volumes was clacroix-test-65942*
and all three of my extra volumes were found.
destroy cmd:
python run.py --log-level DEBUG --cluster-path /tmp/clacroix-test-65942 --destroy
I was not able to reproduce this issue.
from ocs-ci.
@clacroix12 +1
The destroy cluster worked for me and deleted the 3 extra volumes as part of cleanup. Hence, even I was not able to reproduce this issue
This was the command I used to create cluster
python run.py --cluster-name=nberry-may17 --suite=suites/ocs_basic_install.yml --cluster-path=/home/nberry/aws-install/may21 --no-email --no-destroy
Command to destroy cluster
python run.py --destroy --cluster-path /home/nberry/aws-install/may21/
The extra volumes are deleted:
nberry-may21-dsltb-worker-us-east-2a-9ns2k_extra_volume
nberry-may21-dsltb-worker-us-east-2b-k55kt_extra_volume
nberry-may21-dsltb-worker-us-east-2c-7sr59_extra_volume
from ocs-ci.
I have theory, why it doesn't work for me once ... but I wasn't able to install the cluster since then, so I wasn't able to reproduce it yet..
If I'll be able to reproduce it, in my case it will be probably connected with the length of the cluster name.
from ocs-ci.
I was able to reproduce it and it and as I've write above, the reason in my case might be because of long cluster name:
The cluster is created using following command:
python3 run.py --log-level DEBUG --suite suites/ocs_basic_install.yml --cluster-name ocs-ci-jenkins-dahorak-cluster --cluster-path /tmp/ocs-ci-jenkins-dahorak-cluster --no-destroy
And should be destroyed via this command:
python3 run.py --log-level DEBUG --cluster-path /tmp/ocs-ci-jenkins-dahorak-cluster --destroy
But in AWS I see, that it shrink the name of the cluster, so for example
instances are named this way:
ocs-ci-jenkins-dahora-bvxbp-master-0
ocs-ci-jenkins-dahora-bvxbp-master-1
ocs-ci-jenkins-dahora-bvxbp-master-2
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2a-kmvqj
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2b-fkcsw
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2c-xrcmk
And after the destroy command, following 3 volumes remain in the AWS:
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2a-kmvqj_extra_volume vol-03b9b83a1ef047e70 100 GiB gp2 300 May 22, 2019 at 10:43:30 AM UTC+2 us-east-2a available None Okay Not Encrypted
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2b-fkcsw_extra_volume vol-059392d8112372476 100 GiB gp2 300 May 22, 2019 at 10:43:18 AM UTC+2 us-east-2b available None Okay Not Encrypted
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2c-xrcmk_extra_volume vol-0fb20441fb036b461 100 GiB gp2 300 May 22, 2019 at 10:43:24 AM UTC+2 us-east-2c available None
The full output of the destroy command is here:
https://gist.github.com/dahorak/59e1a4d8e7c31ac01773204f7855f6d4
from ocs-ci.
I located logs from 2019-05-20, as volumes listed in description of this issue belongs to cluster I created on on this date.
This can be demonstrated via:
$ cd cluster-2019-05-20/
$ grep mbukatov-ocs4-testing-fmc7p-worker-us-east-2a-9hr9r_extra_volume cephci-run-1558358218636/install_OCS_0.log | wc -l
15
And 1st line when creation of the volume is logged:
2019-05-20 15:51:30,404 - botocore.endpoint - DEBUG - Making request for OperationModel(name=CreateVolume) with params: {'url_path': '/', 'query_string': '', 'method': 'POST', 'headers': {'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8', 'User-Agent': 'Boto3/1.9.143 Python/3.7.3 Linux/5.0.13-300.fc30.x86_64 Botocore/1.12.143'}, 'body': {'Action': 'CreateVolume', 'Version': '2016-11-15', 'AvailabilityZone': 'us-east-2a', 'Encrypted': 'false', 'Size': 100, 'VolumeType': 'gp2', 'TagSpecification.1.ResourceType': 'volume', 'TagSpecification.1.Tag.1.Key': 'Name', 'TagSpecification.1.Tag.1.Value': 'mbukatov-ocs4-testing-fmc7p-worker-us-east-2a-9hr9r_extra_volume'}, 'url': 'https://ec2.us-east-2.amazonaws.com/', 'context': {'client_region': 'us-east-2', 'client_config': <botocore.config.Config object at 0x7f7c268154e0>, 'has_streaming_input': False, 'auth_type': None}}
Daniel's observation about long cluster name is good catch. As you can see, this applies to my case as well. Full name of my cluster was mbukatov-ocs4-testing-cluster-88528
as can be seen here:
2019-05-20 15:16:58,818 - utility.utils - INFO - Executing command: mkdir -p /tmp/mbukatov-ocs4-testing-cluster-88528
while the volume name prefix is just mbukatov-ocs4-testing
.
It seems to me that we should delete the extra volumes using cluster_id
instead of cluster_name
, because that is what is actually used as a prefix in name of all AWS resources. We can also consider rejecting cluster name which is too long.
from ocs-ci.
@mbukatov @dahorak It looks like either terraform or openshift-install is truncating the name if it's too long so this does actually create an issue if we're using the name we think the volumes will have to pattern match them. I like the idea of using the cluster_id
instead. We could also limit the name length so we don't end up with a cluster directory name that doesn't match the corresponding cluster in aws. @petr-balogh thoughts?
from ocs-ci.
@clacroix12 yep, I think that we should propagate this cluster_id from cluster directory to some ENV_DATA and we can use this cluster_id for deleting extra volumes which will solve the issue. About cluster name length I think we should also limit it to some length as we are also adding some random numbers CID to the name. Currently CID is int with length 5 digits, we can decrease it to 3 chars which will be containing letters and digits to have more combination.
from ocs-ci.
@petr-balogh the purpose of the 5 digits we add to the cluster name was to ensure that the directory created was going to be unique. This is because by default all cluster directories are going to /tmp
. However, if we started placing the cluster directory inside of the "run directory" that we create for logs / potentially other things for the entire automation run, we wouldn't need the 5 digit cid anymore. Since we would have a new parent directory for each individual automation run, we wouldn't need to worry about uniqueness at the cluster directory level.
Now cluster dir exists here:
/tmp/my-cluster-name-12345
Proposed solution:
/logs-ceph-mount-point/ocsci-jenkins/run-1559079670/my-cluster-name-12345
Note that run-1559079670
is an example of what our "run" directory would be. Basically run-<timestamp>
or something similar, ensuring uniqueness between automation runs.
Long story short, we can cut the 5 digits out of the equation if we wanted to use the run directory as the parent for cluster directories. This obviously doesn't cover the case where the user provides a cluster_path
at the CLI level, but we could always do something like enforce that the directory is already created before specifying it.
from ocs-ci.
should be fixed, old one.
from ocs-ci.
Related Issues (20)
- [Tier1] Tests fails on setup waiting for aws resource creation. HOT 2
- 1AZ deployment if failing with InsufficientInstanceCapacity
- test_pvc_delete_subvolumegroup fails during setup phase while trying to delete registry PVC
- test_pvc_delete_subvolumegroup fails on ODF 4.15
- Boto3-related issue #9250 requires a permanent solution
- Add the IBMCloud Bare Metal login credentials HOT 2
- Install the ibmcloud "sl" plugin before running the HCI Provider tests HOT 5
- [UI]: ODF 4.15 UI deployment is ailing
- Add support to choose performance profiles during cluster deployments
- test_deletion_sync_after_node_restart[noobaa-core] is failing on IBM Power HOT 2
- test_noobaa_db_backup_recovery_locally is failing on ODF 4.15 on IBM Power HOT 2
- AWS UPI deployment failure HOT 2
- managedserviceaccount-preview needs to be changed to managedserviceaccount since its not tp anymore in acm 2.9
- exec_cmd produces the typeError
- Proper Handling required for oc subcmd string check
- test_cluster_full_and_recovery failing with TypeError: __init__() missing 1 required positional argument: 'value' HOT 2
- `test_log_based_bucket_replication.py` fails in disconnected clusters
- test_capacity_breakdown_ui - add support for versions above 4.8
- Once BZ: 2255557 is backported, remove W/A for labeling master nodes with region for IBM Cloud deployment HOT 1
- Update retrieve mcg-cli to get it from new container shipped to customers
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ocs-ci.