Code Monkey home page Code Monkey logo

Comments (17)

wozniakjan avatar wozniakjan commented on August 11, 2024

looks like snapshot creation timed out, the default timeout is 1 minute and it's very likely that if you have a good amount of data in that PV, it may take longer than that. It is implemented pretty naively as tarball and creating tar.gz for couple of gigabytes may take longer than one minute.

you may configure the timeout to be higher in the external-snapshotter parameters
https://github.com/kubernetes-csi/external-snapshotter#important-optional-arguments-that-are-highly-recommended-to-be-used-1

from csi-driver-nfs.

znaive avatar znaive commented on August 11, 2024

looks like snapshot creation timed out, the default timeout is 1 minute and it's very likely that if you have a good amount of data in that PV, it may take longer than that. It is implemented pretty naively as tarball and creating tar.gz for couple of gigabytes may take longer than one minute.

you may configure the timeout to be higher in the external-snapshotter parameters https://github.com/kubernetes-csi/external-snapshotter#important-optional-arguments-that-are-highly-recommended-to-be-used-1

When I add timeout in yaml I get an error, there is no such field

yaml

# This YAML file shows how to deploy the snapshot controller

# The snapshot controller implements the control loop for CSI snapshot functionality.
# It should be installed as part of the base Kubernetes distribution in an appropriate
# namespace for components implementing base system functionality. For installing with
# Vanilla Kubernetes, kube-system makes sense for the namespace.

---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: snapshot-controller
  namespace: kube-system
spec:
  replicas: 2
  selector:
    matchLabels:
      app: snapshot-controller
  # the snapshot controller won't be marked as ready if the v1 CRDs are unavailable
  # in #504 the snapshot-controller will exit after around 7.5 seconds if it
  # can't find the v1 CRDs so this value should be greater than that
  minReadySeconds: 15
  strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: snapshot-controller
    spec:
      serviceAccountName: snapshot-controller
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName: system-cluster-critical
      securityContext:
        seccompProfile:
          type: RuntimeDefault
      tolerations:
        - key: "node-role.kubernetes.io/master"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
        - key: "node-role.kubernetes.io/controlplane"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
        - key: "node-role.kubernetes.io/control-plane"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      containers:
        - name: snapshot-controller
          image: registry.k8s.io/sig-storage/snapshot-controller:v6.2.2
          args:
            - "--v=2"
            - "--leader-election=true"
            - "--leader-election-namespace=kube-system"
            - "--timeout=30m"
          resources:
            limits:
              memory: 300Mi
            requests:
              cpu: 10m
              memory: 20Mi

log

$ kubectl logs snapshot-controller-6fd7d5d77f-dfb8g -n kube-system
flag provided but not defined: -timeout
Usage of /snapshot-controller:
  -add_dir_header
        If true, adds the file directory to the header of the log messages
  -alsologtostderr
        log to standard error as well as files (no effect when -logtostderr=true)
  -enable-distributed-snapshotting
        Enables each node to handle snapshotting for the local volumes created on that node
  -http-endpoint string
        The TCP network address where the HTTP server for diagnostics, including metrics, will listen (example: :8080). The default is empty string, which means the server is disabled.
  -kube-api-burst int
        Burst to use while communicating with the kubernetes apiserver. Defaults to 10. (default 10)
  -kube-api-qps float
        QPS to use while communicating with the kubernetes apiserver. Defaults to 5.0. (default 5)
  -kubeconfig string
        Absolute path to the kubeconfig file. Required only when running out of cluster.
  -leader-election
        Enables leader election.
  -leader-election-lease-duration duration
        Duration, in seconds, that non-leader candidates will wait to force acquire leadership. Defaults to 15 seconds. (default 15s)
  -leader-election-namespace string
        The namespace where the leader election resource exists. Defaults to the pod namespace if not set.
  -leader-election-renew-deadline duration
        Duration, in seconds, that the acting leader will retry refreshing leadership before giving up. Defaults to 10 seconds. (default 10s)
  -leader-election-retry-period duration
        Duration, in seconds, the LeaderElector clients should wait between tries of actions. Defaults to 5 seconds. (default 5s)
  -log_backtrace_at value
        when logging hits line file:N, emit a stack trace
  -log_dir string
        If non-empty, write log files in this directory (no effect when -logtostderr=true)
  -log_file string
        If non-empty, use this log file (no effect when -logtostderr=true)
  -log_file_max_size uint
        Defines the maximum size a log file can grow to (no effect when -logtostderr=true). Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
  -logtostderr
        log to standard error instead of files (default true)
  -metrics-path /metrics
        The HTTP path where prometheus metrics will be exposed. Default is /metrics. (default "/metrics")
  -one_output
        If true, only write logs to their native severity level (vs also writing to each lower severity level; no effect when -logtostderr=true)
  -prevent-volume-mode-conversion
        Prevents an unauthorised user from modifying the volume mode when creating a PVC from an existing VolumeSnapshot.
  -resync-period duration
        Resync interval of the controller. (default 15m0s)
  -retry-crd-interval-max duration
        Maximum retry interval to wait for CRDs to appear. The default is 5 seconds. (default 5s)
  -retry-interval-max duration
        Maximum retry interval of failed volume snapshot creation or deletion. Default is 5 minutes. (default 5m0s)
  -retry-interval-start duration
        Initial retry interval of failed volume snapshot creation or deletion. It doubles with each failure, up to retry-interval-max. Default is 1 second. (default 1s)
  -skip_headers
        If true, avoid header prefixes in the log messages
  -skip_log_headers
        If true, avoid headers when opening log files (no effect when -logtostderr=true)
  -stderrthreshold value
        logs at or above this threshold go to stderr when writing to files and stderr (no effect when -logtostderr=true or -alsologtostderr=false) (default 2)
  -v value
        number for the log level verbosity
  -version
        Show version.
  -vmodule value
        comma-separated list of pattern=N settings for file-filtered logging
  -worker-threads int
        Number of worker threads. (default 10)

from csi-driver-nfs.

wozniakjan avatar wozniakjan commented on August 11, 2024

the --timeout should be configured on the external-snapshotter container, the one you shared is snapshot-controller, and that one indeed does not support --timeout as an argument.

args:
- "--v=2"
- "--leader-election=true"
- "--leader-election-namespace={{ .Release.Namespace }}"

from csi-driver-nfs.

znaive avatar znaive commented on August 11, 2024

the --timeout should be configured on the external-snapshotter container, the one you shared is snapshot-controller, and that one indeed does not support --timeout as an argument.

args:
- "--v=2"
- "--leader-election=true"
- "--leader-election-namespace={{ .Release.Namespace }}"

So how do I use external-snapshotter, do I need to deploy this, thank you very much for answering me!

from csi-driver-nfs.

wozniakjan avatar wozniakjan commented on August 11, 2024

if you are using helm and running the latest released version v4.4.0, then it should be enabled by default. This is the configuration knob

from csi-driver-nfs.

wozniakjan avatar wozniakjan commented on August 11, 2024

oh, my apologies, I navigated you poorly. It should be actually snapshot-controller, I think you were setting it correctly, let me take a closer look :)

from csi-driver-nfs.

znaive avatar znaive commented on August 11, 2024

oh, my apologies, I navigated you poorly. It should be actually snapshot-controller, I think you were setting it correctly, let me take a closer look :)
Yes, please. Thank you so much.

from csi-driver-nfs.

wozniakjan avatar wozniakjan commented on August 11, 2024

it should be this location in the csi-snapshotter, the is flag implemented there

- name: csi-snapshotter
image: "{{ .Values.image.csiSnapshotter.repository }}:{{ .Values.image.csiSnapshotter.tag }}"
args:
- "--v=2"
- "--csi-address=$(ADDRESS)"
- "--leader-election-namespace={{ .Release.Namespace }}"
- "--leader-election"

from csi-driver-nfs.

znaive avatar znaive commented on August 11, 2024

it should be this location in the csi-snapshotter, the is flag implemented there

- name: csi-snapshotter
image: "{{ .Values.image.csiSnapshotter.repository }}:{{ .Values.image.csiSnapshotter.tag }}"
args:
- "--v=2"
- "--csi-address=$(ADDRESS)"
- "--leader-election-namespace={{ .Release.Namespace }}"
- "--leader-election"

This is correct, thank you. I also wanted to ask if this doesn't support snapshot recovery

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: restore-pvc
spec:
  storageClassName: nfs-csi
  dataSource:
    name: test-nfs-snapshot
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi

It stays pending after it is created

from csi-driver-nfs.

wozniakjan avatar wozniakjan commented on August 11, 2024

snapshot restore should be supported, and your manifest looks correct to me, can you please share some csi-nfs-controller logs?

from csi-driver-nfs.

znaive avatar znaive commented on August 11, 2024

snapshot restore should be supported, and your manifest looks correct to me, can you please share some csi-nfs-controller logs?
No log output

kubectl logs csi-nfs-controller-6bc96c75d7-8wfq2 -n kube-system -c csi-snapshotter -f
image

$ kubectl apply -f pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: restore-pvc
spec:
  storageClassName: nfs-csi
  dataSource:
    name: test-nfs-snapshot
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi

kubectl describe pvc restore-pvc
image

kubectl get volumesnapshot
image

kubectl logs csi-nfs-controller-6bc96c75d7-8wfq2 -n kube-system -c csi-provisioner -f
image

from csi-driver-nfs.

wozniakjan avatar wozniakjan commented on August 11, 2024

looks like again context deadline exceeded, can you perhaps try first with a smaller volume just so we can feel confident it's a matter of size?

from csi-driver-nfs.

znaive avatar znaive commented on August 11, 2024

looks like again context deadline exceeded, can you perhaps try first with a smaller volume just so we can feel confident it's a matter of size?

When I use 200Mi it's normal.

yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-test
spec:
  storageClassName: nfs-csi
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 200Mi
---

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: pvc-test
spec:
  volumeSnapshotClassName: csi-nfs-snapclass
  source:
    persistentVolumeClaimName: pvc-test

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: restore-pvc
spec:
  storageClassName: nfs-csi
  dataSource:
    name: pvc-test
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 200Mi

volumesnapshot

NAME                                                                READYTOUSE   SOURCEPVC          SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS       SNAPSHOTCONTENT                                    CREATIONTIME   AGE
pvc-test                                                            true         pvc-test                                   104           csi-nfs-snapclass   snapcontent-8d204215-3d13-4d6f-b119-6f3eb151ae77   11s            12s

pvc

pvc-test                 Bound    pvc-85ef92d4-b5ff-4b05-bc09-0d9adf284a04   200Mi          RWX            nfs-csi                        7m30s
restore-pvc              Bound    pvc-a6efbc8f-2f97-4fa0-9992-8a5bc30d954b   200Mi          RWO            nfs-csi                        4m26s

Is there any way to snapshot high volume pvc?

from csi-driver-nfs.

wozniakjan avatar wozniakjan commented on August 11, 2024

When I use 200Mi it's normal.

good, we at least established it's operational in your setup :)

Is there any way to snapshot high volume pvc?

you will need to find a sufficient value for --timeout and possibly bump the memory limits for the csi-nfs-controller. Creating a tarball out of 20Gi content over NFS will depend on the network throughput, number of files, and NFS client caching ability. I personally would go as far as removing the memory limit so the driver doesn't get oomkilled and then from metrics of memory consumption you can derive a reasonable memory limit. It could easily be a few GB of memory and take as much as a few hours if the network connection is not very fast.

NFS doesn't really have a way to create native snapshots, that is why it's implemented here as compressed tarballs. For better performance, you may need to explore more feature-rich CSI drivers and set up different storage.

from csi-driver-nfs.

znaive avatar znaive commented on August 11, 2024

When I use 200Mi it's normal.

good, we at least established it's operational in your setup :)

Is there any way to snapshot high volume pvc?

you will need to find a sufficient value for --timeout and possibly bump the memory limits for the csi-nfs-controller. Creating a tarball out of 20Gi content over NFS will depend on the network throughput, number of files, and NFS client caching ability. I personally would go as far as removing the memory limit so the driver doesn't get oomkilled and then from metrics of memory consumption you can derive a reasonable memory limit. It could easily be a few GB of memory and take as much as a few hours if the network connection is not very fast.

NFS doesn't really have a way to create native snapshots, that is why it's implemented here as compressed tarballs. For better performance, you may need to explore more feature-rich CSI drivers and set up different storage.

I've removed the memory limit for all components and get the same result, still not possible

from csi-driver-nfs.

wozniakjan avatar wozniakjan commented on August 11, 2024

per #509 (comment), it was context deadline exceeded, so the --timeout was not sufficient for the setup you have. Perhaps there is too much data, too many files or too slow NFS, for the 30m you set earlier. You can try bumping it higher or possibly benchmark your NFS server with tools like fio to have a better idea about setting the --timeout.

from csi-driver-nfs.

andyzhangx avatar andyzhangx commented on August 11, 2024

@znaive pls refer to following PR to set larger timeout value for snapshot sidecar container instead of snapshot controller:

#512

from csi-driver-nfs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.