Comments (5)
@ejweber I've tested it on v1.5.x-head, but the behavior is a little different from #8305 (comment). If the storage-network setting is set to a longhorn-system/notexist
, instance-manager will get stuck in ContainerCreating
state instead of Running
state. Could you help confirm it's expected behavior? Thank you!
from longhorn.
Pre Ready-For-Testing Checklist
- Where is the reproduce steps/test steps documented?
To reproduce in v1.5.x before fix:
- Ensure no volumes are attached.
- Set the storage-network setting to a random value (e.g.
longhorn-system/notexist
). - Wait for the instance-manager pods to be recreated (as appropriate).
- Delete all longhorn-manager pods.
- Observe instance-manager pods being recreated again (for no reason, because the storage-network setting synced).
eweber@laptop:~/longhorn> kl delete pod -l app=longhorn-manager
pod "longhorn-manager-cl472" deleted
pod "longhorn-manager-cnfkv" deleted
pod "longhorn-manager-pdtj2" deleted
eweber@laptop:~/longhorn> kl get pod | grep instance-manager
instance-manager-e-bbd0405d8fa87cc0209520b4c3262577 0/1 ContainerCreating 0 2s
instance-manager-e-c84e8856027c5474944ab4efd990f514 0/1 ContainerCreating 0 2s
instance-manager-e-e2c3e593061b7d2b3dbc3d36d2b2290a 0/1 Terminating 0 3s
instance-manager-r-bbd0405d8fa87cc0209520b4c3262577 0/1 Terminating 0 3s
instance-manager-r-c84e8856027c5474944ab4efd990f514 0/1 ContainerCreating 0 2s
instance-manager-r-e2c3e593061b7d2b3dbc3d36d2b2290a 0/1 Terminating 0 3s
To test the backported fix in v1.5.x:
- Observe instance-manager pods are not recreated again.
eweber@laptop:~/longhorn> kl delete pod -l app=longhorn-manager
pod "longhorn-manager-cl472" deleted
pod "longhorn-manager-cnfkv" deleted
pod "longhorn-manager-pdtj2" deleted
eweber@laptop:~/longhorn> kl get pod | grep instance-manager
instance-manager-e-4b443d2688949e932fe861a5e08f2a42 1/1 Running 0 5m23s
instance-manager-e-5b946f5536c20defcce3ba51560a1dee 1/1 Running 0 4m53s
instance-manager-e-5d61ec2e1803a0a52a407f466c402633 1/1 Running 0 5m5s
instance-manager-r-4b443d2688949e932fe861a5e08f2a42 1/1 Running 0 5m23s
instance-manager-r-5b946f5536c20defcce3ba51560a1dee 1/1 Running 0 4m54s
instance-manager-r-5d61ec2e1803a0a52a407f466c402633 1/1 Running 0 5m5s
-
Is there a workaround for the issue? If so, where is it documented?
Once a volume attaches, pods are no longer erroneously recreated. -
Does the PR include the explanation for the fix or the feature?
-
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including
backport-needed/*
)?
The PR is at: longhorn/longhorn-manager#2720.
from longhorn.
@yangchiu, this is not the behavior I observe in my cluster, but it may make sense.
My cluster does not have Multus or the NetworkAttachmentDefinition CRD installed. So, Longhorn sets the annotation on the instance-manager pod, but no component in the cluster actually attempts to set up a secondary network.
k8s.v1.cni.cncf.io/networks: '[{"namespace": "longhorn-system", "name": "notexist",
"interface": "lhnet1"}]'
This is fine for testing, as I only want to verify whether setting the Longhorn storage-network setting causes instance-manager pods to be continuously restarted.
Do you have Multus installed in your test cluster? If so, I think it is likely that the container fails to start because there is no longhorn-system/notexist
CRD in the cluster, leading to a failure in the network creation for the instance-manager pods.
If you DO have Multus installed, can we test by EITHER:
- Using a cluster that does NOT have Multus installed.
- Changing the storage-network setting to match an actual NetworkAttachmentDefinition that works in the cluster instead of
longhorn-system/notexist
.
If you DON'T have Multus installed, can you please send a support bundle for evaluation?
from longhorn.
@yangchiu, this is not the behavior I observe in my cluster, but it may make sense.
My cluster does not have Multus or the NetworkAttachmentDefinition CRD installed. So, Longhorn sets the annotation on the instance-manager pod, but no component in the cluster actually attempts to set up a secondary network.
k8s.v1.cni.cncf.io/networks: '[{"namespace": "longhorn-system", "name": "notexist", "interface": "lhnet1"}]'
This is fine for testing, as I only want to verify whether setting the Longhorn storage-network setting causes instance-manager pods to be continuously restarted.
Do you have Multus installed in your test cluster? If so, I think it is likely that the container fails to start because there is no
longhorn-system/notexist
CRD in the cluster, leading to a failure in the network creation for the instance-manager pods.
Yes, I tested with Multus installed in my cluster. Instance mangers get stuck in ContainerCreating
with error message:
Warning FailedCreatePodSandBox 4s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "06f9c589bde0ff43f6673441968b2dde448d465b40c59c93191e4f78c134c1eb": plugin type="multus" failed (add): Multus: [longhorn-system/instance-manager-d6ec2eb00e44dab155201d755760d661/5af79f42-d03a-4dfd-921c-33b766622adf]: error loading k8s delegates k8s args: TryLoadPodDelegates: error in getting k8s network for pod: GetNetworkDelegates: failed getting the delegate: getKubernetesDelegate: cannot find a network-attachment-definition (notexist) in namespace (longhorn-system): network-attachment-definitions.k8s.cni.cncf.io "notexist" not found
So it's expected. Thank you for the clarification!
If you DO have Multus installed, can we test by EITHER:
- Using a cluster that does NOT have Multus installed.
- Changing the storage-network setting to match an actual NetworkAttachmentDefinition that works in the cluster instead of
longhorn-system/notexist
.
Yes, changed the storage-network setting to an existent crd, instance managers work without problem.
from longhorn.
Verified passed on v1.5.x-head (longhorn-manager 63ba2f7) following the test plan.
And Longhorn runs without problem on the storage network pipeline: https://ci.longhorn.io/job/private/job/longhorn-storage-network-test/15/
from longhorn.
Related Issues (20)
- [BUG] VolumeSnapshot keeps in a non-ready state even related LH snapshot and backup are ready HOT 5
- [FEATURE] Show the orginal namespace and PVC name in backup UI
- [TEST] Implement test csi snapshot with volume detached
- [IMPROVEMENT] Add setting to configure support bundle timeout for node bundle collection HOT 2
- [BACKPORT][v1.6.3][IMPROVEMENT] Add setting to configure support bundle timeout for node bundle collection
- [BACKPORT][v1.5.6][IMPROVEMENT] Add setting to configure support bundle timeout for node bundle collection
- [TASK] Reference Architecture and Sizing Guidelines for Longhorn v1.7.x HOT 1
- [TEST] Investigate accessing lab behind vpn
- [BACKPORT][v1.6.3][IMPROVEMENT] System restore unable to restore volume with backing image
- [BUG] Longhorn cifs backups cannot find credentials HOT 8
- [DOC] Incorrect and invalid links HOT 1
- Expanding the volume through UI but not reflecting it in backend. HOT 1
- [TEST][BUG] system restore stuck because of the volume/PV/PVC restoration
- [BACKPORT][v1.6.3][IMPROVEMENT] Improve and simplify chart values.yaml HOT 1
- [BACKPORT][v1.5.6][IMPROVEMENT] Improve and simplify chart values.yaml HOT 1
- Longhorn 1.6.2 - pvc is not ready for workloads HOT 1
- [BUG] Failed to delete a v2 orphan replica
- [FEATURE] Automatically attach the volumes for trimming filesystem HOT 1
- [TEST][FEATURE] Automatically attach the volumes for trimming filesystem
- [BUG] Fresh RWX volume on a fresh cluster install fails to ever mount (dual stack, IPv6-first cluster)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from longhorn.