Code Monkey home page Code Monkey logo

Comments (10)

yangchiu avatar yangchiu commented on September 13, 2024 1

Could you help to create a separate issue for this

Created #8208 for it. Thank you!

from longhorn.

yangchiu avatar yangchiu commented on September 13, 2024

cc @c3y1huang

from longhorn.

yangchiu avatar yangchiu commented on September 13, 2024

The same test case, another type of failure:

ApiException: (0)
Reason: Handshake status 500 Internal Server Error -+-+- 
{'content-length': '29', 'content-type': 'text/plain; charset=utf-8', 'date': 'Mon, 18 Mar 2024 12:42:25 GMT'} -+-+- 
b'container not found ("sleep")'

https://ci.longhorn.io/job/private/job/longhorn-e2e-test/427/

from longhorn.

c3y1huang avatar c3y1huang commented on September 13, 2024

Running negative test case Reboot Node One By One While Workload Heavy Writing. It gets stuck in waiting for a pod of a deployment stable:
But the pod has been running:

[DEBUG] pod e2e-test-deployment-0-b957b9f54-4rt5p retry 59 != 60
Waiting for e2e-test-deployment-0 pods ['e2e-test-deployment-0-b957b9f54-4rt5p', 'e2e-test-deployment-0-b957b9f54-tnpzq'] stable, retry (59) ...
Waiting for e2e-test-deployment-0 pods ['e2e-test-deployment-0-b957b9f54-tnpzq'] stable, retry (60) ...
[DEBUG] pod e2e-test-deployment-0-b957b9f54-4rt5p retry 61 != 60

Somehow the count skipped 60. To address this, we can check whether the retry count exceeds or equals the maximum retry count.

from longhorn.

c3y1huang avatar c3y1huang commented on September 13, 2024

The same test case, another type of failure:

ApiException: (0)
Reason: Handshake status 500 Internal Server Error -+-+- 
{'content-length': '29', 'content-type': 'text/plain; charset=utf-8', 'date': 'Mon, 18 Mar 2024 12:42:25 GMT'} -+-+- 
b'container not found ("sleep")'

https://ci.longhorn.io/job/private/job/longhorn-e2e-test/427/

I don't know what has triggered this.

  1. Volume attached at 12:31:39.
Mar 18 12:31:39 ip-10-0-2-28 k3s[1352]: I0318 12:31:39.291688    1352 operation_generator.go:633] "MountVolume.WaitForAttach succeeded for volume \"pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0\") pod \"e2e-test-deployment-2-7ddccb49f4-l47jl\" (UID: \"89d256e9-5f00-4ac5-a768-94c41fe9f365\") DevicePath \"csi-8c8c2b911395deaf705fc0a43968072e1ac253115a720b249e14f9c24e8755ed\"" pod="default/e2e-test-deployment-2-7ddccb49f4-l47jl"
  1. Deployment verified stable at 12:37:59 (showing 20:37:59.360 in the report).
deployment . And Wait for deployment 0 pods stable
Start / End / Elapsed:	20240318 20:36:53.003 / 20240318 20:37:59.360 / 00:01:06.357
  1. The volume was unmounted.
Mar 18 12:41:49 ip-10-0-2-28 k3s[1352]: E0318 12:41:49.612041    1352 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0 podName: nodeName:}" failed. No retries permitted until 2024-03-18 12:43:51.612014825 +0000 UTC m=+986.522755894 (durationBeforeRetry 2m2s). Error: MountVolume.MountDevice failed for volume "pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0") pod "e2e-test-deployment-2-7ddccb49f4-l47jl" (UID: "89d256e9-5f00-4ac5-a768-94c41fe9f365") : rpc error: code = InvalidArgument desc = volume pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0 hasn't been attached yet
Mar 18 12:41:49 ip-10-0-2-28 k3s[1352]: E0318 12:41:49.611852    1352 csi_attacher.go:364] kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = InvalidArgument desc = volume pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0 hasn't been attached yet
Mar 18 12:41:49 ip-10-0-2-28 k3s[1352]: I0318 12:41:49.593178    1352 operation_generator.go:633] "MountVolume.WaitForAttach succeeded for volume \"pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0\") pod \"e2e-test-deployment-2-7ddccb49f4-l47jl\" (UID: \"89d256e9-5f00-4ac5-a768-94c41fe9f365\") DevicePath \"csi-8c8c2b911395deaf705fc0a43968072e1ac253115a720b249e14f9c24e8755ed\"" pod="default/e2e-test-deployment-2-7ddccb49f4-l47jl"
Mar 18 12:41:49 ip-10-0-2-28 k3s[1352]: I0318 12:41:49.589369    1352 operation_generator.go:623] "MountVolume.WaitForAttach entering for volume \"pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0\") pod \"e2e-test-deployment-2-7ddccb49f4-l47jl\" (UID: \"89d256e9-5f00-4ac5-a768-94c41fe9f365\") DevicePath \"\"" pod="default/e2e-test-deployment-2-7ddccb49f4-l47jl"
Mar 18 12:40:54 ip-10-0-2-28 k3s[1352]: E0318 12:40:54.336071    1352 pod_workers.go:1300] "Error syncing pod, skipping" err="unmounted volumes=[pod-data], unattached volumes=[], failed to process volumes=[]: context deadline exceeded" pod="default/e2e-test-statefulset-2-0" podUID="51755125-cf24-4164-814c-ff779da0505c"
Mar 18 12:40:53 ip-10-0-2-28 k3s[1352]: E0318 12:40:53.336090    1352 pod_workers.go:1300] "Error syncing pod, skipping" err="unmounted volumes=[pod-data], unattached volumes=[], failed to process volumes=[]: context deadline exceeded" pod="default/e2e-test-deployment-2-7ddccb49f4-l47jl" podUID="89d256e9-5f00-4ac5-a768-94c41fe9f365"
Mar 18 12:39:47 ip-10-0-2-28 k3s[1352]: E0318 12:39:47.768586    1352 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-fb423261-1200-48a9-8077-e177d56746e2 podName: nodeName:}" failed. No retries permitted until 2024-03-18 12:41:49.768566096 +0000 UTC m=+864.679307165 (durationBeforeRetry 2m2s). Error: MountVolume.MountDevice failed for volume "pvc-fb423261-1200-48a9-8077-e177d56746e2" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-fb423261-1200-48a9-8077-e177d56746e2") pod "e2e-test-statefulset-2-0" (UID: "51755125-cf24-4164-814c-ff779da0505c") : rpc error: code = InvalidArgument desc = volume pvc-fb423261-1200-48a9-8077-e177d56746e2 hasn't been attached yet
Mar 18 12:39:47 ip-10-0-2-28 k3s[1352]: E0318 12:39:47.768380    1352 csi_attacher.go:364] kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = InvalidArgument desc = volume pvc-fb423261-1200-48a9-8077-e177d56746e2 hasn't been attached yet
Mar 18 12:39:47 ip-10-0-2-28 k3s[1352]: I0318 12:39:47.745025    1352 operation_generator.go:633] "MountVolume.WaitForAttach succeeded for volume \"pvc-fb423261-1200-48a9-8077-e177d56746e2\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-fb423261-1200-48a9-8077-e177d56746e2\") pod \"e2e-test-statefulset-2-0\" (UID: \"51755125-cf24-4164-814c-ff779da0505c\") DevicePath \"csi-e3372656429ba5ddd2e47ad9bd23d10b758e716cac78ab1764808b39e824c032\"" pod="default/e2e-test-statefulset-2-0"
Mar 18 12:39:47 ip-10-0-2-28 k3s[1352]: I0318 12:39:47.741015    1352 operation_generator.go:623] "MountVolume.WaitForAttach entering for volume \"pvc-fb423261-1200-48a9-8077-e177d56746e2\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-fb423261-1200-48a9-8077-e177d56746e2\") pod \"e2e-test-statefulset-2-0\" (UID: \"51755125-cf24-4164-814c-ff779da0505c\") DevicePath \"\"" pod="default/e2e-test-statefulset-2-0"
Mar 18 12:39:47 ip-10-0-2-28 k3s[1352]: E0318 12:39:47.562797    1352 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0 podName: nodeName:}" failed. No retries permitted until 2024-03-18 12:41:49.562778041 +0000 UTC m=+864.473519756 (durationBeforeRetry 2m2s). Error: MountVolume.MountDevice failed for volume "pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0") pod "e2e-test-deployment-2-7ddccb49f4-l47jl" (UID: "89d256e9-5f00-4ac5-a768-94c41fe9f365") : rpc error: code = InvalidArgument desc = volume pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0 hasn't been attached yet
Mar 18 12:39:47 ip-10-0-2-28 k3s[1352]: E0318 12:39:47.562639    1352 csi_attacher.go:364] kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = InvalidArgument desc = volume pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0 hasn't been attached yet
Mar 18 12:39:47 ip-10-0-2-28 k3s[1352]: I0318 12:39:47.543889    1352 operation_generator.go:633] "MountVolume.WaitForAttach succeeded for volume \"pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0\") pod \"e2e-test-deployment-2-7ddccb49f4-l47jl\" (UID: \"89d256e9-5f00-4ac5-a768-94c41fe9f365\") DevicePath \"csi-8c8c2b911395deaf705fc0a43968072e1ac253115a720b249e14f9c24e8755ed\"" pod="default/e2e-test-deployment-2-7ddccb49f4-l47jl"
Mar 18 12:39:47 ip-10-0-2-28 k3s[1352]: I0318 12:39:47.540153    1352 operation_generator.go:623] "MountVolume.WaitForAttach entering for volume \"pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0\") pod \"e2e-test-deployment-2-7ddccb49f4-l47jl\" (UID: \"89d256e9-5f00-4ac5-a768-94c41fe9f365\") DevicePath \"\"" pod="default/e2e-test-deployment-2-7ddccb49f4-l47jl"
Mar 18 12:38:36 ip-10-0-2-28 k3s[1352]: E0318 12:38:36.336254    1352 pod_workers.go:1300] "Error syncing pod, skipping" err="unmounted volumes=[pod-data], unattached volumes=[], failed to process volumes=[]: context deadline exceeded" pod="default/e2e-test-deployment-2-7ddccb49f4-l47jl" podUID="89d256e9-5f00-4ac5-a768-94c41fe9f365"
Mar 18 12:38:36 ip-10-0-2-28 k3s[1352]: E0318 12:38:36.336255    1352 pod_workers.go:1300] "Error syncing pod, skipping" err="unmounted volumes=[pod-data], unattached volumes=[], failed to process volumes=[]: context deadline exceeded" pod="default/e2e-test-statefulset-2-0" podUID="51755125-cf24-4164-814c-ff779da0505c"
Mar 18 12:37:45 ip-10-0-2-28 k3s[1352]: E0318 12:37:45.734339    1352 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-fb423261-1200-48a9-8077-e177d56746e2 podName: nodeName:}" failed. No retries permitted until 2024-03-18 12:39:47.734319757 +0000 UTC m=+742.645061587 (durationBeforeRetry 2m2s). Error: MountVolume.MountDevice failed for volume "pvc-fb423261-1200-48a9-8077-e177d56746e2" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-fb423261-1200-48a9-8077-e177d56746e2") pod "e2e-test-statefulset-2-0" (UID: "51755125-cf24-4164-814c-ff779da0505c") : rpc error: code = InvalidArgument desc = volume pvc-fb423261-1200-48a9-8077-e177d56746e2 hasn't been attached yet
Mar 18 12:37:45 ip-10-0-2-28 k3s[1352]: E0318 12:37:45.734181    1352 csi_attacher.go:364] kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = InvalidArgument desc = volume pvc-fb423261-1200-48a9-8077-e177d56746e2 hasn't been attached yet
Mar 18 12:37:45 ip-10-0-2-28 k3s[1352]: I0318 12:37:45.717154    1352 operation_generator.go:633] "MountVolume.WaitForAttach succeeded for volume \"pvc-fb423261-1200-48a9-8077-e177d56746e2\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-fb423261-1200-48a9-8077-e177d56746e2\") pod \"e2e-test-statefulset-2-0\" (UID: \"51755125-cf24-4164-814c-ff779da0505c\") DevicePath \"csi-e3372656429ba5ddd2e47ad9bd23d10b758e716cac78ab1764808b39e824c032\"" pod="default/e2e-test-statefulset-2-0"
Mar 18 12:37:45 ip-10-0-2-28 k3s[1352]: I0318 12:37:45.713199    1352 operation_generator.go:623] "MountVolume.WaitForAttach entering for volume \"pvc-fb423261-1200-48a9-8077-e177d56746e2\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-fb423261-1200-48a9-8077-e177d56746e2\") pod \"e2e-test-statefulset-2-0\" (UID: \"51755125-cf24-4164-814c-ff779da0505c\") DevicePath \"\"" pod="default/e2e-test-statefulset-2-0"
Mar 18 12:37:45 ip-10-0-2-28 k3s[1352]: E0318 12:37:45.534198    1352 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0 podName: nodeName:}" failed. No retries permitted until 2024-03-18 12:39:47.534177625 +0000 UTC m=+742.444919394 (durationBeforeRetry 2m2s). Error: MountVolume.MountDevice failed for volume "pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0") pod "e2e-test-deployment-2-7ddccb49f4-l47jl" (UID: "89d256e9-5f00-4ac5-a768-94c41fe9f365") : rpc error: code = InvalidArgument desc = volume pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0 hasn't been attached yet
Mar 18 12:37:45 ip-10-0-2-28 k3s[1352]: E0318 12:37:45.534030    1352 csi_attacher.go:364] kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = InvalidArgument desc = volume pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0 hasn't been attached yet
Mar 18 12:37:45 ip-10-0-2-28 k3s[1352]: I0318 12:37:45.515715    1352 operation_generator.go:633] "MountVolume.WaitForAttach succeeded for volume \"pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0\") pod \"e2e-test-deployment-2-7ddccb49f4-l47jl\" (UID: \"89d256e9-5f00-4ac5-a768-94c41fe9f365\") DevicePath \"csi-8c8c2b911395deaf705fc0a43968072e1ac253115a720b249e14f9c24e8755ed\"" pod="default/e2e-test-deployment-2-7ddccb49f4-l47jl"
Mar 18 12:37:45 ip-10-0-2-28 k3s[1352]: I0318 12:37:45.511729    1352 operation_generator.go:623] "MountVolume.WaitForAttach entering for volume \"pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-9b4adf15-a26f-42f9-b62b-f0d6b501f8e0\") pod \"e2e-test-deployment-2-7ddccb49f4-l47jl\" (UID: \"89d256e9-5f00-4ac5-a768-94c41fe9f365\") DevicePath \"\"" pod="default/e2e-test-deployment-2-7ddccb49f4-l47jl"
  1. Hit container not found error at 12:42:25.
ApiException: (0)
Reason: Handshake status 500 Internal Server Error -+-+- {'content-length': '29', 'content-type': 'text/plain; charset=utf-8', 'date': 'Mon, 18 Mar 2024 12:42:25 GMT'} -+-+- b'container not found ("sleep")'

@yangchiu , any idea? Do we hit this often?

from longhorn.

yangchiu avatar yangchiu commented on September 13, 2024

Do we hit this often?

Never seen this in previous release testing phases. Since it's the first release testing after the refactoring, it needs more time to figure out the reproducibility.

from longhorn.

yangchiu avatar yangchiu commented on September 13, 2024

Another type of failure:

Got /data/random-data checksum = d41d8cd98f00b204e9800998ecf8427e

Expected checksum = dd: can't open '/data/random-data': Input/output error
d41d8cd98f00b204e9800998ecf8427e

https://ci.longhorn.io/job/private/job/longhorn-e2e-test/430/

Need to check whether it's a real issue or test case defect.

from longhorn.

c3y1huang avatar c3y1huang commented on September 13, 2024

Another type of failure:

Got /data/random-data checksum = d41d8cd98f00b204e9800998ecf8427e

Expected checksum = dd: can't open '/data/random-data': Input/output error
d41d8cd98f00b204e9800998ecf8427e

https://ci.longhorn.io/job/private/job/longhorn-e2e-test/430/

Need to check whether it's a real issue or test case defect.

Could you help to create a separate issue for this because this is a different kind of failure. Combining newly discovered failures back to the original one makes the issue very hard to weigh its complexity. For this issue, I will just fix the looping issue as in the description. Thank you.

from longhorn.

longhorn-io-github-bot avatar longhorn-io-github-bot commented on September 13, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at: issue description

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at:
    The PR for the chart change is at:

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at

  • Which areas/issues this PR might have potential impacts on?
    Area negative testing
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at longhorn/longhorn-tests#1821
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

from longhorn.

c3y1huang avatar c3y1huang commented on September 13, 2024

Closing issue because PR merged.

from longhorn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.