Comments (3)
The issue can be reproduced by
- Create a RWX volume.
- Create a workload A using the volume. The running workload A ensures the share-manager pod keeps running.
- Repeatedly attach and detach workload B using the volume. The memory usage (
cat /proc/<PID of nfs-ganesha>/status | grep VmRSS
) of the nfs-ganesha increases over time.
from longhorn.
Pre Ready-For-Testing Checklist
- Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:
- Create a 3 node cluster
- Create first workload with a RWX volume by https://github.com/longhorn/longhorn/blob/master/examples/rwx/rwx-nginx-deployment.yaml
- Create second workload with the RWX volume.
- Scale down the second workload and scale up repeatedly 100 times
- Find the PID of the nfs-ganesha in the share-manager pod by
ps aux
- Observe the VmRSS of nfs-ganesha in the share-manager pod by
cat /proc/<nfs-ganesha PID>/status | grep VmRSS
- VmRSS in LH v1.6.1 is significantly larger than the value after applying the fix.
-
Does the PR include the explanation for the fix or the feature?
-
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including
backport-needed/*
)?
The PR is at
longhorn/nfs-ganesha#13
longhorn/longhorn-share-manager#204
- Which areas/issues this PR might have potential impacts on?
Area: RWX volume, memory leak, upstream
Issues
from longhorn.
Verified on v1.6.x-head 20240507
- longhorn v1.6.x-head 46706be
- nfs-ganesha longhorn-ganesha-v5 longhorn/nfs-ganesha@996a59c
- longhorn-share-manager v1.6.x-head longhorn/longhorn-share-manager@510b21a
The test steps
#8394 (comment)
- Create first workload with a RWX volume by https://github.com/longhorn/longhorn/blob/master/examples/rwx/rwx-nginx-deployment.yaml
- Scale up the replicas to 3.
- Check if 3 workloads are in the "Running" state.
- Scale down the replicas to 1.
- Check if one workload are in the "Running" state.
We can test steps 2-5 using the following shell script.
deployment_rwx_test.sh
#!/bin/bash
# Define the deployment name
DEPLOYMENT_NAME="rwx-test"
KUBECONFIF="/home/ryao/Desktop/note/longhorn-tool/ryao-161.yaml"
for ((i=1; i<=100; i++)); do
# Scale deployment to 10 replicas
kubectl --kubeconfig=$KUBECONFIF scale deployment $DEPLOYMENT_NAME --replicas=3
# Wait for the deployment to have 3 ready replicas
until [[ "$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')" == "3" ]]; do
ready_replicas=$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')
echo "Iteration #$i: $DEPLOYMENT_NAME has $ready_replicas ready replicas"
sleep 1
done
# Check if all pods are in the "Running" state
while [[ $(kubectl --kubeconfig=$KUBECONFIF get pods -l=app=$DEPLOYMENT_NAME -o=jsonpath='{.items[*].status.phase}') != "Running Running Running" ]]; do
echo "Not all pods are in the 'Running' state. Waiting..."
sleep 5
done
# Scale deployment down to 1 replicas
kubectl --kubeconfig=$KUBECONFIF scale deployment $DEPLOYMENT_NAME --replicas=1
# Wait for the deployment to have 1 ready replicas
until [[ "$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')" == "1" ]]; do
ready_replicas=$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')
echo "Iteration #$i: $DEPLOYMENT_NAME has $ready_replicas ready replicas"
sleep 1
done
# Check if all pods are in the "Running" state
while [[ $(kubectl --kubeconfig=$KUBECONFIF get pods -l=app=$DEPLOYMENT_NAME -o=jsonpath='{.items[*].status.phase}') != "Running" ]]; do
echo "Not all pods are in the 'Running' state. Waiting..."
sleep 5
done
done
- Find the
PID
of thenfs-ganesha
in theshare-manager
pod byps aux
- Observe the
VmRSS
ofnfs-ganesha
in theshare-manager
pod bycat /proc/<nfs-ganesha PID>/status | grep VmRSS
Result Passed
- We were also able to reproduce this issue on v1.6.1.
- After executing the script, the output for
v1.6.1
is as follows:
Every 2.0s: cat /proc/29/status | grep VmRSS share-manager-pvc-119d403e-ae17-4f4f-aa7f-06e7bf40fca2: Tue May 7 09:54:38 2024
VmRSS: 47192 kB
For the v1.6.x-head
Every 2.0s: cat /proc/29/status | grep VmRSS share-manager-pvc-f22c2fdf-330e-4c22-aea2-45a10c570cbf: Tue May 7 10:09:11 2024
VmRSS: 41604 kB
from longhorn.
Related Issues (20)
- [TASK] Reference Architecture and Sizing Guidelines for Longhorn v1.7.x HOT 1
- [TEST] Investigate accessing lab behind vpn
- [BACKPORT][v1.6.3][IMPROVEMENT] System restore unable to restore volume with backing image HOT 1
- [BUG] Longhorn cifs backups cannot find credentials HOT 9
- [DOC] Incorrect and invalid links HOT 1
- Expanding the volume through UI but not reflecting it in backend. HOT 1
- [TEST][BUG] system restore stuck because of the volume/PV/PVC restoration
- [BACKPORT][v1.6.3][IMPROVEMENT] Improve and simplify chart values.yaml HOT 1
- [BACKPORT][v1.5.6][IMPROVEMENT] Improve and simplify chart values.yaml HOT 1
- Longhorn 1.6.2 - pvc is not ready for workloads HOT 1
- [BUG] Failed to delete a v2 orphan replica
- [FEATURE] Automatically attach the volumes for trimming filesystem HOT 1
- [TEST][FEATURE] Automatically attach the volumes for trimming filesystem
- [BUG] Fresh RWX volume on a fresh cluster install fails to ever mount (dual stack, IPv6-first cluster)
- [UI][IMPROVEMENT] Tweak some minor UI issues
- [UI][FEATURE] Multiple backup stores support
- [BUG] Request for a guide to the longhorn metric
- [BUG] System backup failed because backup creation failed. HOT 1
- [BACKPORT][v1.6.3][IMPROVEMENT] BackingImage UI improvement
- [BUG] System Restore Stuck at Pending due to Tolerations not Applied HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from longhorn.