Code Monkey home page Code Monkey logo

Comments (5)

yangchiu avatar yangchiu commented on May 26, 2024 1

@ejweber I've tested it on v1.5.x-head, but the behavior is a little different from #8305 (comment). If the storage-network setting is set to a longhorn-system/notexist, instance-manager will get stuck in ContainerCreating state instead of Running state. Could you help confirm it's expected behavior? Thank you!

from longhorn.

longhorn-io-github-bot avatar longhorn-io-github-bot commented on May 26, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?

To reproduce in v1.5.x before fix:

  1. Ensure no volumes are attached.
  2. Set the storage-network setting to a random value (e.g. longhorn-system/notexist).
  3. Wait for the instance-manager pods to be recreated (as appropriate).
  4. Delete all longhorn-manager pods.
  5. Observe instance-manager pods being recreated again (for no reason, because the storage-network setting synced).
eweber@laptop:~/longhorn> kl delete pod -l app=longhorn-manager 
pod "longhorn-manager-cl472" deleted
pod "longhorn-manager-cnfkv" deleted
pod "longhorn-manager-pdtj2" deleted

eweber@laptop:~/longhorn> kl get pod | grep instance-manager
instance-manager-e-bbd0405d8fa87cc0209520b4c3262577   0/1     ContainerCreating   0          2s
instance-manager-e-c84e8856027c5474944ab4efd990f514   0/1     ContainerCreating   0          2s
instance-manager-e-e2c3e593061b7d2b3dbc3d36d2b2290a   0/1     Terminating         0          3s
instance-manager-r-bbd0405d8fa87cc0209520b4c3262577   0/1     Terminating         0          3s
instance-manager-r-c84e8856027c5474944ab4efd990f514   0/1     ContainerCreating   0          2s
instance-manager-r-e2c3e593061b7d2b3dbc3d36d2b2290a   0/1     Terminating         0          3s

To test the backported fix in v1.5.x:

  1. Observe instance-manager pods are not recreated again.
eweber@laptop:~/longhorn> kl delete pod -l app=longhorn-manager 
pod "longhorn-manager-cl472" deleted
pod "longhorn-manager-cnfkv" deleted
pod "longhorn-manager-pdtj2" deleted

eweber@laptop:~/longhorn> kl get pod | grep instance-manager
instance-manager-e-4b443d2688949e932fe861a5e08f2a42   1/1     Running   0               5m23s
instance-manager-e-5b946f5536c20defcce3ba51560a1dee   1/1     Running   0               4m53s
instance-manager-e-5d61ec2e1803a0a52a407f466c402633   1/1     Running   0               5m5s
instance-manager-r-4b443d2688949e932fe861a5e08f2a42   1/1     Running   0               5m23s
instance-manager-r-5b946f5536c20defcce3ba51560a1dee   1/1     Running   0               4m54s
instance-manager-r-5d61ec2e1803a0a52a407f466c402633   1/1     Running   0               5m5s
  • Is there a workaround for the issue? If so, where is it documented?
    Once a volume attaches, pods are no longer erroneously recreated.

  • Does the PR include the explanation for the fix or the feature?

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at: longhorn/longhorn-manager#2720.

from longhorn.

ejweber avatar ejweber commented on May 26, 2024

@yangchiu, this is not the behavior I observe in my cluster, but it may make sense.

My cluster does not have Multus or the NetworkAttachmentDefinition CRD installed. So, Longhorn sets the annotation on the instance-manager pod, but no component in the cluster actually attempts to set up a secondary network.

    k8s.v1.cni.cncf.io/networks: '[{"namespace": "longhorn-system", "name": "notexist",
      "interface": "lhnet1"}]'

This is fine for testing, as I only want to verify whether setting the Longhorn storage-network setting causes instance-manager pods to be continuously restarted.

Do you have Multus installed in your test cluster? If so, I think it is likely that the container fails to start because there is no longhorn-system/notexist CRD in the cluster, leading to a failure in the network creation for the instance-manager pods.

If you DO have Multus installed, can we test by EITHER:

  1. Using a cluster that does NOT have Multus installed.
  2. Changing the storage-network setting to match an actual NetworkAttachmentDefinition that works in the cluster instead of longhorn-system/notexist.

If you DON'T have Multus installed, can you please send a support bundle for evaluation?

from longhorn.

yangchiu avatar yangchiu commented on May 26, 2024

@yangchiu, this is not the behavior I observe in my cluster, but it may make sense.

My cluster does not have Multus or the NetworkAttachmentDefinition CRD installed. So, Longhorn sets the annotation on the instance-manager pod, but no component in the cluster actually attempts to set up a secondary network.

    k8s.v1.cni.cncf.io/networks: '[{"namespace": "longhorn-system", "name": "notexist",
      "interface": "lhnet1"}]'

This is fine for testing, as I only want to verify whether setting the Longhorn storage-network setting causes instance-manager pods to be continuously restarted.

Do you have Multus installed in your test cluster? If so, I think it is likely that the container fails to start because there is no longhorn-system/notexist CRD in the cluster, leading to a failure in the network creation for the instance-manager pods.

Yes, I tested with Multus installed in my cluster. Instance mangers get stuck in ContainerCreating with error message:

Warning  FailedCreatePodSandBox  4s    kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "06f9c589bde0ff43f6673441968b2dde448d465b40c59c93191e4f78c134c1eb": plugin type="multus" failed (add): Multus: [longhorn-system/instance-manager-d6ec2eb00e44dab155201d755760d661/5af79f42-d03a-4dfd-921c-33b766622adf]: error loading k8s delegates k8s args: TryLoadPodDelegates: error in getting k8s network for pod: GetNetworkDelegates: failed getting the delegate: getKubernetesDelegate: cannot find a network-attachment-definition (notexist) in namespace (longhorn-system): network-attachment-definitions.k8s.cni.cncf.io "notexist" not found

So it's expected. Thank you for the clarification!

If you DO have Multus installed, can we test by EITHER:

  1. Using a cluster that does NOT have Multus installed.
  2. Changing the storage-network setting to match an actual NetworkAttachmentDefinition that works in the cluster instead of longhorn-system/notexist.

Yes, changed the storage-network setting to an existent crd, instance managers work without problem.

from longhorn.

yangchiu avatar yangchiu commented on May 26, 2024

Verified passed on v1.5.x-head (longhorn-manager 63ba2f7) following the test plan.

And Longhorn runs without problem on the storage network pipeline: https://ci.longhorn.io/job/private/job/longhorn-storage-network-test/15/

from longhorn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.