backport <a class="issue-link js-issue-link" data-error-text="Failed to load title" da

The issue can be reproduced by Create a RWX volume. C

Pre Ready-For-Testing Checklist <li class="task-

Verified on v1.6.x-head 20240507 longhorn v1.

[BACKPORT][v1.6.2][BUG] share-manager-pvc appears to be leaking memory about longhorn HOT 3 CLOSED

github-actions commented on May 28, 2024

[BACKPORT][v1.6.2][BUG] share-manager-pvc appears to be leaking memory

from longhorn.

Comments (3)

derekbit commented on May 28, 2024

The issue can be reproduced by

Create a RWX volume.
Create a workload A using the volume. The running workload A ensures the share-manager pod keeps running.
Repeatedly attach and detach workload B using the volume. The memory usage (cat /proc/<PID of nfs-ganesha>/status | grep VmRSS) of the nfs-ganesha increases over time.

from longhorn.

longhorn-io-github-bot commented on May 28, 2024

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:

Create a 3 node cluster
Create first workload with a RWX volume by https://github.com/longhorn/longhorn/blob/master/examples/rwx/rwx-nginx-deployment.yaml
Create second workload with the RWX volume.
Scale down the second workload and scale up repeatedly 100 times
Find the PID of the nfs-ganesha in the share-manager pod by ps aux
Observe the VmRSS of nfs-ganesha in the share-manager pod by cat /proc/<nfs-ganesha PID>/status | grep VmRSS
VmRSS in LH v1.6.1 is significantly larger than the value after applying the fix.

Does the PR include the explanation for the fix or the feature?
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at

longhorn/nfs-ganesha#13
longhorn/longhorn-share-manager#204

Which areas/issues this PR might have potential impacts on?
Area: RWX volume, memory leak, upstream
Issues

from longhorn.

roger-ryao commented on May 28, 2024

Verified on v1.6.x-head 20240507

longhorn v1.6.x-head 46706be
nfs-ganesha longhorn-ganesha-v5 longhorn/nfs-ganesha@996a59c
longhorn-share-manager v1.6.x-head longhorn/longhorn-share-manager@510b21a

The test steps
#8394 (comment)

Create first workload with a RWX volume by https://github.com/longhorn/longhorn/blob/master/examples/rwx/rwx-nginx-deployment.yaml
Scale up the replicas to 3.
Check if 3 workloads are in the "Running" state.
Scale down the replicas to 1.
Check if one workload are in the "Running" state.
We can test steps 2-5 using the following shell script.

deployment_rwx_test.sh

#!/bin/bash

# Define the deployment name
DEPLOYMENT_NAME="rwx-test"
KUBECONFIF="/home/ryao/Desktop/note/longhorn-tool/ryao-161.yaml"

for ((i=1; i<=100; i++)); do
    # Scale deployment to 10 replicas
    kubectl --kubeconfig=$KUBECONFIF scale deployment $DEPLOYMENT_NAME --replicas=3

    # Wait for the deployment to have 3 ready replicas
    until [[ "$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')" == "3" ]]; do
        ready_replicas=$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')
        echo "Iteration #$i: $DEPLOYMENT_NAME has $ready_replicas ready replicas"
        sleep 1
    done

    # Check if all pods are in the "Running" state
    while [[ $(kubectl --kubeconfig=$KUBECONFIF get pods -l=app=$DEPLOYMENT_NAME -o=jsonpath='{.items[*].status.phase}') != "Running Running Running" ]]; do
        echo "Not all pods are in the 'Running' state. Waiting..."
        sleep 5
    done

    # Scale deployment down to 1 replicas
    kubectl --kubeconfig=$KUBECONFIF scale deployment $DEPLOYMENT_NAME --replicas=1

    # Wait for the deployment to have 1 ready replicas
    until [[ "$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')" == "1" ]]; do
        ready_replicas=$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')
        echo "Iteration #$i: $DEPLOYMENT_NAME has $ready_replicas ready replicas"
        sleep 1
    done

    # Check if all pods are in the "Running" state
    while [[ $(kubectl --kubeconfig=$KUBECONFIF get pods -l=app=$DEPLOYMENT_NAME -o=jsonpath='{.items[*].status.phase}') != "Running" ]]; do
        echo "Not all pods are in the 'Running' state. Waiting..."
        sleep 5
    done
done

Find the PID of the nfs-ganesha in the share-manager pod by ps aux
Observe the VmRSS of nfs-ganesha in the share-manager pod by cat /proc/<nfs-ganesha PID>/status | grep VmRSS

Result Passed

We were also able to reproduce this issue on v1.6.1.
After executing the script, the output for v1.6.1 is as follows:

Every 2.0s: cat /proc/29/status | grep VmRSS                     share-manager-pvc-119d403e-ae17-4f4f-aa7f-06e7bf40fca2: Tue May  7 09:54:38 2024

VmRSS:     47192 kB

For the v1.6.x-head

Every 2.0s: cat /proc/29/status | grep VmRSS                    share-manager-pvc-f22c2fdf-330e-4c22-aea2-45a10c570cbf: Tue May  7 10:09:11 2024

VmRSS:     41604 kB

from longhorn.

[BACKPORT][v1.6.2][BUG] share-manager-pvc appears to be leaking memory about longhorn HOT 3 CLOSED

Comments (3)

Pre Ready-For-Testing Checklist

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent