Code Monkey home page Code Monkey logo

k8s-snapshots's Introduction

Interval-based Volume Snapshots and Expiry on Kubernetes

What you do: Create a custom SnapshotRule resource which defines your desired snapshot intervals. What I do: Create snapshots of your volumes, and expire old ones using a Grandfather-father-son backup scheme.

Supported Environments:

  • Google Compute Engine disks.
  • AWS EBS disks.
  • Digital Ocean.

Want to help adding support for other backends? It's pretty straightforward. Have a look at the API that backends need to implement.

Quickstart

A persistent volume claim:

cat <<EOF | kubectl apply -f -
apiVersion: "k8s-snapshots.elsdoerfer.com/v1"
kind: SnapshotRule
metadata:
  name: postgres
spec:
  deltas: P1D P30D
  persistentVolumeClaim: postgres-data
EOF

A specific AWS EC2 volume:

cat <<EOF | kubectl apply -f -
apiVersion: "k8s-snapshots.elsdoerfer.com/v1"
kind: SnapshotRule
metadata:
  name: mysql
spec:
  deltas: P1D P30D
  backend: aws
  disk:
     region: eu-west-1
     volumeId: vol-0aa6f44aad0daf9f2
EOF

You can also use an annotation instead of the CRDs:

kubectl patch pv pvc-01f74065-8fe9-11e6-abdd-42010af00148 -p \
  '{"metadata": {"annotations": {"backup.kubernetes.io/deltas": "P1D P30D P360D"}}}'

Usage

How to enable backups

To backup a volume, you can create a SnapshotRule custom resource. See more on this in the section further doiwn below.

Alternatively, you can add an annotation with the name backup.kubernetes.io/deltas to either your PersistentVolume or PersistentVolumeClaim resources.

Since PersistentVolumes are often created automatically for you by Kubernetes, you may want to annotate the volume claim in your resource definition file. Alternatively, you can kubectl edit pv a PersistentVolume created by Kubernetes and add the annotation.

The value of the annotation are a set of deltas that define how often a snapshot is created, and how many snapshots should be kept. See the section above for more information on how deltas work.

In the end, your annotation may look like this:

backup.kubernetes.io/deltas: PT1H P2D P30D P180D

There is also the option of manually specifying the volume names to be backed up as options to the k8s-snapshots daemon. See below for more information.

How the deltas work

The expiry logic of tarsnapper is used.

The generations are defined by a list of deltas formatted as ISO 8601 durations (this differs from tarsnapper). PT60S or PT1M means a minute, PT12H or P0.5D is half a day, P1W or P7D is a week. The number of backups in each generation is implied by it's and the parent generation's delta.

For example, given the deltas PT1H P1D P7D, the first generation will consist of 24 backups each one hour older than the previous (or the closest approximation possible given the available backups), the second generation of 7 backups each one day older than the previous, and backups older than 7 days will be discarded for good.

If the daemon is not running for a while, it will still try to approximate your desired snapshot scheme as closely as possible.

The most recent backup is always kept.

The first delta is the backup interval.

Setup

k8s-snapshots needs access to your Kubernetes cluster resources (to read the desired snapshot configuration) and access to your cloud infrastructure (to make snapshots).

Depending on your environment, it may be able to configure itself. Or, you might need to provide some configuration options.

Use the example deployment file given below to start off.

cat <<EOF | kubectl create -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: k8s-snapshots
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: k8s-snapshots
  template:
    metadata:
      labels:
        app: k8s-snapshots
    spec:
      containers:
      - name: k8s-snapshots
        image: elsdoerfer/k8s-snapshots:latest
EOF

1. Based on your cluster.

See the docs/ folder for platform-specific instructions.

2. For Role-based Access Control (RBAC) enabled clusters

In Kubernetes clusters with RBAC, the required permissions need to be provided to the k8s-snapshots pods to watch and list persistentvolume or persistentvolumeclaims. We provide a manifest to setup a ServiceAccount with a minimal set of permissions in rbac.yaml.

kubectl apply -f manifests/rbac.yaml

Furthermore, under GKE, "Because of the way Container Engine checks permissions when you create a Role or ClusterRole, you must first create a RoleBinding that grants you all of the permissions included in the role you want to create."

If the above kubectl apply command produces an error about "attempt to grant extra privileges", the following will grant your user the necessary privileges first, so that you can then bind them to the service account:

  kubectl create clusterrolebinding your-user-cluster-admin-binding --clusterrole=cluster-admin [email protected]

Finally, adjust the deployment by adding serviceAccountName: k8s-snapshots to the spec (else you'll end up using the "default" service account), as follows:

<snip>
    spec:
     serviceAccountName: k8s-snapshots
     containers:
      - name: k8s-snapshots
        image: elsdoerfer/k8s-snapshots:v2.0
</snip>

Further Configuration Options

Pinging a third party service

PING_URL We'll send a GET request to this url whenever a backup completes. This is useful for integrating with monitoring services like Cronitor or Dead Man's Snitch.

Make snapshot names more readable

If your persistent volumes are auto-provisioned by Kubernetes, then you'll end up with snapshot names such as pv-pvc-01f74065-8fe9-11e6-abdd-42010af00148. If you want that prettier, set the enviroment variable USE_CLAIM_NAME=true. Instead of the auto-generated name of the persistent volume, k8s-snapshots will instead use the name that you give to your PersistentVolumeClaim.

SnapshotRule resources

It's possible to ask k8s-snapshots to create snapshots of volumes for which no PersistentVolume object exists within the Kubernetes cluster. For example, you might have a volume at your Cloud provider that you use within Kubernetes by referencing it directly.

To do this, we use a custom Kubernetes resource, SnapshotRule.

First, you need to create this custom resource.

On Kubernetes 1.7 and higher:

cat <<EOF | kubectl create -f -
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: snapshotrules.k8s-snapshots.elsdoerfer.com
spec:
  group: k8s-snapshots.elsdoerfer.com
  version: v1
  scope: Namespaced
  names:
    plural: snapshotrules
    singular: snapshotrule
    kind: SnapshotRule
    shortNames:
    - sr
EOF

Or on Kubernetes 1.6 and lower:

cat <<EOF | kubectl create -f -
apiVersion: apps/v1
kind: ThirdPartyResource
metadata:
  name: snapshot-rule.k8s-snapshots.elsdoerfer.com
description: "Defines snapshot management rules for a disk."
versions:
- name: v1
EOF

You can then create SnapshotRule resources:

cat <<EOF | kubectl apply -f -
apiVersion: "k8s-snapshots.elsdoerfer.com/v1"
kind: SnapshotRule
metadata:
  name: mysql
spec:
  deltas: P1D P30D
  backend: aws
  disk:
     region: eu-west-1
     volumeId: vol-0aa6f44aad0daf9f2
EOF

This is an example for backing up an EBS disk on the Amazon cloud. The disk option requires different keys, depending on the backend. See the examples folder.

You may also point SnapshotRule resources to PersistentVolumes (or PersistentVolumeClaims). This is intended as an alternative to adding an annotation; it may be desirable for some to separate the snapshot functionality from the resource.

cat <<EOF | kubectl apply -f -
apiVersion: "k8s-snapshots.elsdoerfer.com/v1"
kind: SnapshotRule
metadata:
  name: mysql
spec:
  deltas: P1D P30D
  persistentVolumeClaim: datadir-mysql
EOF

Backing up the etcd volumes of a kops cluster

After setting up the custom resource definitions (see previous section), use snapshot rules as defined in the examples/backup-kops-etcd.yml file. Reference the volume ids of your etcd volumes.

Other environment variables

LOG_LEVEL **Default: INFO**. Possible values: DEBUG, INFO, WARNING, ERROR
JSON_LOG **Default: False**. Output the log messages as JSON objects for easier processing.
TZ **Default: UTC**. Used to change the timezone. ie. TZ=America/Montreal

FAQ

What if I manually create snapshots for the same volumes that k8s-snapshots manages?

Starting with v0.3, when k8s-snapshots decides when to create the next snapshot, and which snapshots it deletes, it no longer considers snapshots that are not correctly labeled by it.

k8s-snapshots's People

Contributors

adambom avatar akerouanton avatar bekriebel avatar corymsmith avatar dependabot[bot] avatar dirrk avatar ekimekim avatar funkypenguin avatar jeremiahbowen avatar joar avatar jonathanrohland avatar joshua-vandenhoek avatar josmo avatar jpnauta avatar leg100 avatar mikekap avatar miracle2k avatar mrtyler avatar shanegibbs avatar stromvirvel avatar wadahiro avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

k8s-snapshots's Issues

Investigate: Load more than the AWS snapshot limit

We are running the k8s-snapshots:v2.0 container in GKE.

After a while, we start getting a snapshot every minute and our container gets stuck in a crash loop. After deleting all the snapshots and starting again, it works as expected for a while and then starts crash looping again.

From what I can tell, the following happens:

  1. Once we reach a certain threshold of snapshots (turns out to be ~500), snapshots start happening more frequent than has been specified.
  2. The more snapshots we have, the worse it gets. We reach a point where we get a new snapshot every minute.
  3. We eventually reach our quota limit for snapshots.
  4. When k8s-snapshots creates a snapshot, the GCP operation is flagged as complete, but when the app requests the snapshot info from GCP, it doesn't exist (due to quota limits). This causes a 404 error and crashes the app.

After a bit of digging, I put together a fix that seems to work for us: #51

Feature Request: Trigger snapshots in parallel

I have added backup annotations to 5 PVCs in my cluster, and it seems that right now snapshots are triggered one at a time, and one must complete before the next one is started. Can we have all 5 start at the same time?

Also as an aside, it would be nice if I could trigger snapshots are a certain time, like cron based snapshotting, so that volumes in a distributed system would get snapshotted as close to the same time as possible.

GCR detects vulnerabilities (0 critical) in elsdoerfer/k8s-snapshots:v2.0

This is mostly an FYI for the maintainers and for users that the current upstream v2.0 image has some vulnerabilities according to the Google Container Registry vulnerability scanner:

Critical 0
High 176
Medium 457

Certainly, some of these vulnerabilities are not meaningful in the context of a Docker container running k8s-snapshots, e.g. an openssh vulnerability which "allows remote attackers to execute arbitrary local PKCS#11 modules by leveraging control over a forwarded agent-socket".

I rebuilt current master (b21015c) on top of the latest python:3.6 image. Doing this reduces the total to High 26, Medium 147.

I tried to build current master with python:3.6-alpine but it didn't work (I didn't dig any further):

ERROR: Could not build wheels for aiohttp which use PEP 517 and cannot be installed directly

I also tried with python:37-alpine (same buildtime error as above) and python:37 (runtime error: TypeError: 'async for' received an object from _aiter_ that does not implement _anext_: generator).

Feel free to close this if it isn't helpful. My team thought we should pass on what we'd noticed and learned. Thanks for k8s-snapshots. :)

Authorization issues on AWS EKS

Summary

When deploying k8s-snapshots on an AWS EKS kubernetes cluster, it cannot create snapshots because of missing permissions in AWS.

I know that you're suggesting to run the controller on the master nodes, but since AWS EKS is a managed Kubernetes cluster, I don't have access to the master nodes for custom workloads.

Therefore I have some questions:

  • How does k8s-snapshots authenticate against AWS API? (Where does it get the credentials?)
  • Can I override the credentials somehow, as you can do it on Google Cloud?

Steps to reproduce

  1. Deploy a PVC with k8s-snapshot configuration
cat << EOF | k apply -f -
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: jrebel
  annotations:
    "backup.kubernetes.io/deltas": "PT1M PT5M PT1H"
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: ssd-general
EOF
  1. Deploy k8s-snapshots deployment and rbac as stated in the README

  2. Wait for the k8s-snapshots pod being created

Expected result

After one minute, in the AWS console a new snapshot for the given EBS is created.

Actual result

No EBS snapshot is created. k8s-snapshots pod status is first Error, then CrashLoopBackOff. Checking the pod's logs shows EC2ResponseError: 403 Forbidden, see:
https://gist.github.com/moepot/09ece52f86fe6724c63f2e17779ded2a

Where to find the created snapshot

Hi, sorry if the question is quite dummy, I am still new in Kubernetes thing. Where can I view the created snapshots? My first impression is I can view it under Snapshots menu in GCE (https://console.cloud.google.com/compute/snapshots), but I can't find any.

Also, could it be the case that my deployment failed to create snapshot? I found this Exception logged on created k8s-snapshots pod.

2017-12-30T17:29:06.303183Z watch-resources.worker.error   [k8s_snapshots.kube] message=watch-resources.worker.error resource_type_name=PersistentVolume severity=ERROR
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pykube/http.py", line 99, in raise_for_status
    resp.raise_for_status()
  File "/usr/local/lib/python3.6/site-packages/requests/models.py", line 935, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://10.15.240.1:443/api/v1/persistentvolumes?watch=true

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/kube.py", line 181, in worker
    for event in sync_iterator:
  File "/usr/local/lib/python3.6/site-packages/pykube/query.py", line 156, in object_stream
    self.api.raise_for_status(r)
  File "/usr/local/lib/python3.6/site-packages/pykube/http.py", line 106, in raise_for_status
    raise HTTPError(resp.status_code, payload["message"])
pykube.exceptions.HTTPError: persistentvolumes is forbidden: User "system:serviceaccount:core:default" cannot watch persistentvolumes at the cluster scope: Unknown user "system:serviceaccount:core:default"
2017-12-30T17:29:08.295234Z watch-resources.worker.error   [k8s_snapshots.kube] message=watch-resources.worker.error resource_type_name=PersistentVolumeClaim severity=ERROR
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pykube/http.py", line 99, in raise_for_status
    resp.raise_for_status()
  File "/usr/local/lib/python3.6/site-packages/requests/models.py", line 935, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://10.15.240.1:443/api/v1/persistentvolumeclaims?watch=true

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/kube.py", line 181, in worker
    for event in sync_iterator:
  File "/usr/local/lib/python3.6/site-packages/pykube/query.py", line 156, in object_stream
    self.api.raise_for_status(r)
  File "/usr/local/lib/python3.6/site-packages/pykube/http.py", line 106, in raise_for_status
    raise HTTPError(resp.status_code, payload["message"])
pykube.exceptions.HTTPError: persistentvolumeclaims is forbidden: User "system:serviceaccount:core:default" cannot watch persistentvolumeclaims at the cluster scope: Unknown user "system:serviceaccount:core:default"
2017-12-30T17:29:09.293575Z watch-resources.worker.error   [k8s_snapshots.kube] message=watch-resources.worker.error resource_type_name=SnapshotRule severity=ERROR
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pykube/http.py", line 99, in raise_for_status
    resp.raise_for_status()
  File "/usr/local/lib/python3.6/site-packages/requests/models.py", line 935, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://10.15.240.1:443/apis/k8s-snapshots.elsdoerfer.com/v1/snapshotrules?watch=true

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/kube.py", line 181, in worker
    for event in sync_iterator:
  File "/usr/local/lib/python3.6/site-packages/pykube/query.py", line 156, in object_stream
    self.api.raise_for_status(r)
  File "/usr/local/lib/python3.6/site-packages/pykube/http.py", line 106, in raise_for_status
    raise HTTPError(resp.status_code, payload["message"])
pykube.exceptions.HTTPError: snapshotrules.k8s-snapshots.elsdoerfer.com is forbidden: User "system:serviceaccount:core:default" cannot watch snapshotrules.k8s-snapshots.elsdoerfer.com at the cluster scope: Unknown user "system:serviceaccount:core:default"

docs: CustomResourceDefinition is mandatory? + A tip for kops users

Hello,

I wanted to check my understanding on a couple things before I offer a PR.

CustomResourceDefinition is mandatory?

I have a k8s 1.8.0 cluster. Following the README I deployed k8s-snapshots (both v2.0 and dev), annotated a Persistent Volume, and hit this error:

2017-10-05T04:02:34.136420193Z requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://100.64.0.1:443/apis/k8s-snapshots.elsdoerfer.com/v1/snapshotrules?watch=true

k8s-snapshots.elsdoerfer.com was not in the output of curl https://100.64.0.1:443/apis/.
I deployed the CustomResourceDefinition from the "Manual snapshot rules" section later in the README and the error went away.

Is it expected that the CRD is mandatory, at least in k8s 1.7+? If so, I'll update the docs. If not, I can provide more detail about my setup if this is a bug worth investigating.

A tip for kops users

We use kops to manage our k8s cluster. k8s-snapshots didn't work out of the box due to a permissions issue. If you agree, I'd like to add a tip about this to the README for fellow kops users:

k8s-snapshots need EBS and S3 permissions to take and save snapshots. Under the kops IAM Role scheme, only Masters have these permissions. The easiest solution is to run k8s-snapshots on Masters.

To run on a Master, we need to:

To do this, add the following to the above manifest for the k8s-snapshots Deployment:

spec:
  ...
  template:
  ...
    spec:
      ...
      tolerations:
      - key: "node-role.kubernetes.io/master"
        operator: "Equal"
        value: ""
        effect: "NoSchedule"
      nodeSelector:
        kubernetes.io/role: master

Thanks

k8s-snapshots is cool! :)

AWS Role

Hi,

As we're using kube2iam, I'm trying to determine which AWS policy would suffice for k8s-snapshots. Does anyone already have an example of this, by any chance?

--
Kind regards,
Tim

Describe restore procedure

k8s-snapshots works like a charm, thanks!

I am on GCP, and I am looking for a way to restore a disk and attach it to a pod.

Do you have a procedure that describes this?

Sometimes a duplicate snapshot is created

As best as I ca tell from the log:

  • While a snapshot is running, the scheduling coroutine schedules it again (due to a duplicate update from the rule generator).
  • After the first snapshot is completed, the second one is started right way, without the scheduler running again, with the newly created one being considered in the plan this time around.

Error: No such backend

When defining snapshot rules for PVCs on GKE I'm getting rule.invalid errors. Relevant config and outputs are below. I'm happy to troubleshoot this, but I could use some guidance on where to begin.

Deployment

kubectl describe deployments --namespace=kube-system k8s-snapshot
Name:                   k8s-snapshot
Namespace:              kube-system
CreationTimestamp:      Wed, 10 Oct 2018 09:23:01 -0700
Labels:                 app=k8s-snapshot
                        chart=k8s-snapshot-0.1.0
                        heritage=Tiller
                        release=k8s-snapshot
Annotations:            deployment.kubernetes.io/revision=1
Selector:               app=k8s-snapshot,release=k8s-snapshot
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:  app=k8s-snapshot
           release=k8s-snapshot
  Containers:
   k8s-snapshot:
    Image:  elsdoerfer/k8s-snapshots:v2.0
    Port:   <none>
    Environment:
      GCLOUD_JSON_KEYFILE_NAME:  /var/secrets/google-application-credentials/credentials.json
      USE_CLAIM_NAME:            true
      LOG_LEVEL:                 DEBUG
    Mounts:
      /var/secrets/google-application-credentials/ from google-application-credentials (ro)
  Volumes:
   google-application-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  google-application-credentials
    Optional:    false
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  <none>
NewReplicaSet:   k8s-snapshot-558c758f86 (1/1 replicas created)
Events:
  Type    Reason             Age                From                   Message
  ----    ------             ----               ----                   -------
  Normal  ScalingReplicaSet  26m (x4 over 37m)  deployment-controller  Scaled down replica set k8s-snapshot-558c758f86 to 0
  Normal  ScalingReplicaSet  26m (x5 over 47m)  deployment-controller  Scaled up replica set k8s-snapshot-558c758f86 to 1

SnapshotRule Definition:

apiVersion: "k8s-snapshots.elsdoerfer.com/v1"
kind: SnapshotRule
metadata:
  name: gitlab-gitaly
spec:
  deltas: PT1H P1D P7D
  persistentVolumeClaim: repo-data-gitlab-gitaly-0

CRD definition

kubectl describe crd snapshotrules.k8s-snapshots.elsdoerfer.com
Name:         snapshotrules.k8s-snapshots.elsdoerfer.com
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  apiextensions.k8s.io/v1beta1
Kind:         CustomResourceDefinition
Metadata:
  Creation Timestamp:  2018-10-10T16:23:01Z
  Generation:          1
  Resource Version:    98171
  Self Link:           /apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/snapshotrules.k8s-snapshots.elsdoerfer.com
  UID:                 c1c7c450-cca8-11e8-8736-42010a8a0fd0
Spec:
  Group:  k8s-snapshots.elsdoerfer.com
  Names:
    Kind:       SnapshotRule
    List Kind:  SnapshotRuleList
    Plural:     snapshotrules
    Short Names:
      sr
    Singular:  snapshotrule
  Scope:       Namespaced
  Version:     v1
Status:
  Accepted Names:
    Kind:       SnapshotRule
    List Kind:  SnapshotRuleList
    Plural:     snapshotrules
    Short Names:
      sr
    Singular:  snapshotrule
  Conditions:
    Last Transition Time:  2018-10-10T16:23:01Z
    Message:               no conflicts found
    Reason:                NoConflicts
    Status:                True
    Type:                  NamesAccepted
    Last Transition Time:  2018-10-10T16:23:01Z
    Message:               the initial names have been accepted
    Reason:                InitialNamesAccepted
    Status:                True
    Type:                  Established
Events:                    <none>

PVC definition

kubectl describe pvc repo-data-gitlab-gitaly-0
Name:          repo-data-gitlab-gitaly-0
Namespace:     default
StorageClass:  standard
Status:        Bound
Volume:        pvc-754650ef-cc39-11e8-8736-42010a8a0fd0
Labels:        app=gitaly
               release=gitlab
Annotations:   pv.kubernetes.io/bind-completed=yes
               pv.kubernetes.io/bound-by-controller=yes
               volume.beta.kubernetes.io/storage-provisioner=kubernetes.io/gce-pd
Finalizers:    []
Capacity:      50Gi
Access Modes:  RWO
Events:        <none>

LOG output

[k8s_snapshots.core] backend=None message=No such backed: "None" resource=<SnapshotRule gitlab-gitaly> severity=ERROR structured_error=[{'type': 'ConfigurationError', 'message': 'No such backed: "None"', 'data': {'error': ModuleNotFoundError("No module named 'k8s_snapshots.backends.None'",)}, 'readable': ['Traceback (most recent call last):\n', '  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/core.py", line 115, in rule_from_resource\n    backend = get_backend(backend_name)\n', '  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/backends/__init__.py", line 22, in get_backend\n    raise ConfigurationError(f\'No such backed: "{name}"\', error=e)\n', 'k8s_snapshots.errors.ConfigurationError: ConfigurationError: No such backed: "None" {\'error\': ModuleNotFoundError("No module named \'k8s_snapshots.backends.None\'",)}\n']}]

Update example in README

When I copy-pasted your example in README, the tag v.0.3 was used which doesn't exist on your docker hub repo. Consider changing that to v2.0 for users to ease through the quickstart section.

Feature Request: Move snapshot to a different region

It would be very convenient if it was possible to store snapshots in a different region, as it is a requirement for some users. It is possible to copy a snapshot to a different region using boto:

https://stackoverflow.com/questions/20631335/boto-copy-snapshot-to-another-region#20636112

I imagine the flow to be a bit like this:

  • Check in other region if a new snapshot is required
  • Create local snapshot
  • Copy snapshot to remote region
  • Remove local snapshot

Helm chart

I built a helm chart for this. Do you want it in:

I can submit a PR against either. The latter is my preferred choice, as it's becoming kind of a canonical "please look for charts here" type thing.

doc: README timing example doesn't match description

Quoting the README:

kubectl patch pv pvc-01f74065-8fe9-11e6-abdd-42010af00148 -p \
  '{"metadata": {"annotations": {"backup.kubernetes.io/deltas": "PT1H P30D P360D"}}}'

k8s-snapshots will now run in your cluster, and per the deltas given, it will create a daily snapshot of the volume. It will keep 30 daily snapshots, and then for one year it will keep a monthly snapshot. If the daemon is not running for a while, it will still try to approximate your desired snapshot scheme as closely as possible.

But PT1H means hourly, not daily. The prose matches P1D P30D P360D.

Adding label in snapshot

Can we add static label or copy label from disk to snapshot?
a good idea can to to permit json templating with PV/PVC description.

Service crashes when deltas include Month or Year

If deltas include M (month) or Y (year), the service will crash as it is unable to sort the values. For example, using deltas: P1D P1W P1M P1Y will result in an error.

The appears to be because isodate uses a datetime.timedelta object for hours, days, and weeks - but uses a Duration object for months and years. deltas=[datetime.timedelta(1), datetime.timedelta(7), isodate.duration.Duration(0, 0, 0, years=0, months=1), isodate.duration.Duration(0, 0, 0, years=1, months=0)]

A workaround is to just use days or weeks to represent months/years (P30D/P365D), but being able to use the Month or Year designations would make the rules more readable.

At minimum, catching these values and throwing an understandable error would be helpful.

Full error:

2018-02-06T17:10:34.555553Z Unhandled exception in main task [k8s_snapshots.__main__] loop=<_UnixSelectorEventLoop running=False closed=False debug=False> main_task=<Task finished coro=<daemon() done, defined at /usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/core.py:444> exception=TypeError("'<' not supported between instances of 'Duration' and 'datetime.timedelta'",)> message=Unhandled exception in main task severity=ERROR
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/__main__.py", line 58, in main
    loop.run_until_complete(main_task)
  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 467, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/core.py", line 469, in daemon
    await asyncio.gather(*tasks)
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/core.py", line 438, in backuper
    await make_backup(ctx, current_target_rule)
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/snapshot.py", line 148, in make_backup
    await expire_snapshots(ctx, rule)
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/snapshot.py", line 39, in expire_snapshots
    to_keep = expire(snapshots_with_date, rule.deltas)
  File "/usr/local/lib/python3.6/site-packages/tarsnapper/expire.py", line 57, in expire
    deltas.sort()
TypeError: '<' not supported between instances of 'Duration' and 'datetime.timedelta'
2018-02-06T17:10:34.565667Z Shutdown complete              [k8s_snapshots.__main__] message=Shutdown complete severity=INFO
Traceback (most recent call last):
  File "/usr/local/bin/k8s-snapshots", line 11, in <module>
    load_entry_point('k8s-snapshots==0.0.0', 'console_scripts', 'k8s-snapshots')()
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/__main__.py", line 58, in main
  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 467, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/core.py", line 469, in daemon
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/core.py", line 438, in backuper
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/snapshot.py", line 148, in make_backup
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/snapshot.py", line 39, in expire_snapshots
  File "/usr/local/lib/python3.6/site-packages/tarsnapper/expire.py", line 57, in expire
    deltas.sort()
TypeError: '<' not supported between instances of 'Duration' and 'datetime.timedelta'

Feature request: Expose metrics in the Prometheus format

This is looks like exactly what I was looking for; I got pointed to it by asyncho. I will spin it up in a cluster later today, but am curious: Would you consider adding Prometheus metrics to the Daemon? Prometheus provide a python library (https://github.com/prometheus/client_python), and Kubernetes supports the prometheus format quite well (https://github.com/google/cadvisor/blob/master/docs/application_metrics.md)

I could also look at doing a PR for this, if it's something that you'd accept / help me with (I don't really know python).

Alpine breaks `pendulum` package when not passing a `TZ` env var

Okay so this one is a bit my fault for not having tested the alpine image more extensively ; then again we should eventually build test into the package if we're using it in production ;)

Anyway...

It seems some of the dependencies are having problems with the alpine image ; notably we get the following error from pendulum:

2019-07-25T23:19:52.299533Z Unhandled exception in main task [k8s_snapshots.__main__] loop=<_UnixSelectorEventLoop running=False closed=False debug=False> main_task=<Task finished coro=<daemon() done, defined at /usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/core.py:579> exception=RuntimeError('Can not find any timezone configuration',)> message=Unhandled exception in main task severity=ERROR
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/__main__.py", line 58, in main
    loop.run_until_complete(main_task)
  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/core.py", line 604, in daemon
    await asyncio.gather(*tasks)
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/core.py", line 573, in backuper
    await make_backup(ctx, current_target_rule)
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/snapshot.py", line 94, in make_backup
    time_start = pendulum.now()
  File "/usr/local/lib/python3.6/site-packages/pendulum/pendulum.py", line 286, in now
    return cls.instance(dt, tz)
  File "/usr/local/lib/python3.6/site-packages/pendulum/pendulum.py", line 221, in instance
    tzinfo=tz
  File "/usr/local/lib/python3.6/site-packages/pendulum/pendulum.py", line 157, in __init__
    self._tz = self._safe_create_datetime_zone(tzinfo)
  File "/usr/local/lib/python3.6/site-packages/pendulum/pendulum.py", line 66, in _safe_create_datetime_zone
    return cls._local_timezone()
  File "/usr/local/lib/python3.6/site-packages/pendulum/pendulum.py", line 94, in _local_timezone
    return local_timezone()
  File "/usr/local/lib/python3.6/site-packages/pendulum/tz/__init__.py", line 25, in local_timezone
    return LocalTimezone.get()
  File "/usr/local/lib/python3.6/site-packages/pendulum/tz/local_timezone.py", line 25, in get
    name = cls.get_local_tz_name()
  File "/usr/local/lib/python3.6/site-packages/pendulum/tz/local_timezone.py", line 60, in get_local_tz_name
    return getattr(cls, 'get_tz_name_for_{}'.format(os))()
  File "/usr/local/lib/python3.6/site-packages/pendulum/tz/local_timezone.py", line 169, in get_tz_name_for_unix
    raise RuntimeError('Can not find any timezone configuration')
RuntimeError: Can not find any timezone configuration
2019-07-25T23:19:55.373924Z Shutdown complete              [k8s_snapshots.__main__] message=Shutdown complete severity=INFO
Traceback (most recent call last):
  File "/usr/local/bin/k8s-snapshots", line 11, in <module>
    load_entry_point('k8s-snapshots==0.0.0', 'console_scripts', 'k8s-snapshots')()
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/__main__.py", line 58, in main
  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/core.py", line 604, in daemon
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/core.py", line 573, in backuper
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/snapshot.py", line 94, in make_backup
  File "/usr/local/lib/python3.6/site-packages/pendulum/pendulum.py", line 286, in now
    return cls.instance(dt, tz)
  File "/usr/local/lib/python3.6/site-packages/pendulum/pendulum.py", line 221, in instance
    tzinfo=tz
  File "/usr/local/lib/python3.6/site-packages/pendulum/pendulum.py", line 157, in __init__
    self._tz = self._safe_create_datetime_zone(tzinfo)
  File "/usr/local/lib/python3.6/site-packages/pendulum/pendulum.py", line 66, in _safe_create_datetime_zone
    return cls._local_timezone()
  File "/usr/local/lib/python3.6/site-packages/pendulum/pendulum.py", line 94, in _local_timezone
    return local_timezone()
  File "/usr/local/lib/python3.6/site-packages/pendulum/tz/__init__.py", line 25, in local_timezone
    return LocalTimezone.get()
  File "/usr/local/lib/python3.6/site-packages/pendulum/tz/local_timezone.py", line 25, in get
    name = cls.get_local_tz_name()
  File "/usr/local/lib/python3.6/site-packages/pendulum/tz/local_timezone.py", line 60, in get_local_tz_name
    return getattr(cls, 'get_tz_name_for_{}'.format(os))()
  File "/usr/local/lib/python3.6/site-packages/pendulum/tz/local_timezone.py", line 169, in get_tz_name_for_unix
    raise RuntimeError('Can not find any timezone configuration')
RuntimeError: Can not find any timezone configuration

I was able to fix the issue by providing a TZ=America/Montreal to my container but this is definitely not ideal.

I'm not sure what is the best solution here ; we can either

  1. drop alpine altogether; ;
  2. add the TZ env var to the readme / required vars;
  3. or default the image with TZ=UTC; leaving other people to change it if they want.

@miracle2k let me know what you think.

Snapshots not working with Kubernetes 1.12 & RBAC

This might not be the actual cause ; but I updated my Kubernetes cluster to version 1.12 with RBAC and I can't get k8s-snapshots to work.

Every time it runs I get the following:

2019-07-23T20:38:18.595427Z Unhandled exception in main task [k8s_snapshots.__main__] loop=<_UnixSelectorEventLoop running=False closed=False debug=False> main_task=<Task finished coro=<daemon() done, defined at /usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/core.py:444> exception=<SnapshotCreateError: Error creating snapshot data={}>> message=Unhandled exception in main task severity=ERROR structured_error=[{'type': 'AttributeError', 'message': "'NoneType' object has no attribute 'create_snapshot'", 'readable': ['Traceback (most recent call last):\n', '  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/snapshot.py", line 99, in make_backup\n    snapshot_description=serialize.dumps(rule),\n', '  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/snapshot.py", line 171, in create_snapshot\n    lambda: backend.create_snapshot(\n', '  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/asyncutils.py", line 12, in run_in_executor\n    return await asyncio.get_event_loop().run_in_executor(None, func)\n', '  File "/usr/local/lib/python3.6/concurrent/futures/thread.py", line 55, in run\n    result = self.fn(*self.args, **self.kwargs)\n', '  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/snapshot.py", line 175, in <lambda>\n    snapshot_description\n', '  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/backends/aws.py", line 102, in create_snapshot\n    snapshot = connection.create_snapshot(\n', "AttributeError: 'NoneType' object has no attribute 'create_snapshot'\n"]}, {'type': 'SnapshotCreateError', 'message': 'Error creating snapshot', 'data': {}, 'readable': ['Traceback (most recent call last):\n', '  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/__main__.py", line 58, in main\n    loop.run_until_complete(main_task)\n', '  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 467, in run_until_complete\n    return future.result()\n', '  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/core.py", line 469, in daemon\n    await asyncio.gather(*tasks)\n', '  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/core.py", line 438, in backuper\n    await make_backup(ctx, current_target_rule)\n', '  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/snapshot.py", line 123, in make_backup\n    ) from exc\n', 'k8s_snapshots.errors.SnapshotCreateError: SnapshotCreateError: Error creating snapshot {}\n']}]
2019-07-23T20:38:25.505967Z Shutdown complete              [k8s_snapshots.__main__] message=Shutdown complete severity=INFO
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/snapshot.py", line 99, in make_backup
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/snapshot.py", line 171, in create_snapshot
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/asyncutils.py", line 12, in run_in_executor
  File "/usr/local/lib/python3.6/concurrent/futures/thread.py", line 55, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/snapshot.py", line 175, in <lambda>
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/backends/aws.py", line 102, in create_snapshot
AttributeError: 'NoneType' object has no attribute 'create_snapshot'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/k8s-snapshots", line 11, in <module>
    load_entry_point('k8s-snapshots==0.0.0', 'console_scripts', 'k8s-snapshots')()
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/__main__.py", line 58, in main
  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 467, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/core.py", line 469, in daemon
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/core.py", line 438, in backuper
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/snapshot.py", line 123, in make_backup
k8s_snapshots.errors.SnapshotCreateError: SnapshotCreateError: Error creating snapshot {}

I've track the error to a boto3 connection error but I can't understand why boto3.client('ec2', region_name=region) would be returning None

Any idea what might be happening here?

Quickstart doesn't work

I'm on Google Cloud Platform Kubernetes Engine running master version 1.11.6-gke.3. This is what I get when I try quickstart:

$ kubectl logs -n kube-system -f k8s-snapshots-5bb755c6cb-bpb6n
2019-02-02T23:52:05.066886Z rule.heartbeat                 [k8s_snapshots.core] message=rule.heartbeat rules=None severity=INFO
2019-02-02T23:52:05.071043Z kube-config.from-service-account [k8s_snapshots.context] message=kube-config.from-service-account severity=INFO
2019-02-02T23:52:05.122363Z watch-resources.worker.error   [k8s_snapshots.kube] message=watch-resources.worker.error resource_type_name=PersistentVolume severity=ERROR
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pykube/http.py", line 99, in raise_for_status
    resp.raise_for_status()
  File "/usr/local/lib/python3.6/site-packages/requests/models.py", line 935, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://10.0.0.1:443/api/v1/persistentvolumes?watch=true

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/kube.py", line 181, in worker
    for event in sync_iterator:
  File "/usr/local/lib/python3.6/site-packages/pykube/query.py", line 156, in object_stream
    self.api.raise_for_status(r)
  File "/usr/local/lib/python3.6/site-packages/pykube/http.py", line 106, in raise_for_status
    raise HTTPError(resp.status_code, payload["message"])
pykube.exceptions.HTTPError: persistentvolumes is forbidden: User "system:serviceaccount:kube-system:default" cannot watch persistentvolumes at the cluster scope
2019-02-02T23:52:07.105507Z watch-resources.worker.error   [k8s_snapshots.kube] message=watch-resources.worker.error resource_type_name=PersistentVolumeClaim severity=ERROR
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pykube/http.py", line 99, in raise_for_status
    resp.raise_for_status()
  File "/usr/local/lib/python3.6/site-packages/requests/models.py", line 935, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://10.0.0.1:443/api/v1/persistentvolumeclaims?watch=true

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/kube.py", line 181, in worker
    for event in sync_iterator:
  File "/usr/local/lib/python3.6/site-packages/pykube/query.py", line 156, in object_stream
    self.api.raise_for_status(r)
  File "/usr/local/lib/python3.6/site-packages/pykube/http.py", line 106, in raise_for_status
    raise HTTPError(resp.status_code, payload["message"])
pykube.exceptions.HTTPError: persistentvolumeclaims is forbidden: User "system:serviceaccount:kube-system:default" cannot watch persistentvolumeclaims at the cluster scope
2019-02-02T23:52:08.102549Z watch-resources.worker.error   [k8s_snapshots.kube] message=watch-resources.worker.error resource_type_name=SnapshotRule severity=ERROR
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pykube/http.py", line 99, in raise_for_status
    resp.raise_for_status()
  File "/usr/local/lib/python3.6/site-packages/requests/models.py", line 935, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://10.0.0.1:443/apis/k8s-snapshots.elsdoerfer.com/v1/snapshotrules?watch=true

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/kube.py", line 181, in worker
    for event in sync_iterator:
  File "/usr/local/lib/python3.6/site-packages/pykube/query.py", line 156, in object_stream
    self.api.raise_for_status(r)
  File "/usr/local/lib/python3.6/site-packages/pykube/http.py", line 106, in raise_for_status
    raise HTTPError(resp.status_code, payload["message"])
pykube.exceptions.HTTPError: snapshotrules.k8s-snapshots.elsdoerfer.com is forbidden: User "system:serviceaccount:kube-system:default" cannot watch snapshotrules.k8s-snapshots.elsdoerfer.com at the cluster scope

Expected behavior and exclusion of PersistentVolumes

As I re-read the documentation and try this out I just wanted to nail down what the expected behavior of PV snapshots should be.

Right now I see the following:

  • PV's without an annotation will have an initial snapshot created, but no additional snapshots and therefore no aging/expiration
  • PV's with a defined annotation will create and expire subsequent snapshots based on the delta rule

Pretty simple, but I'm looking for clarification, because my initial impression after reading the documentation is that the k8s-snapshots pod would only create snapshots for PV's that do have the annotation with a defined delta. If I'm wrong and what I described above is the correct behavior then I feel like the documentation should be clearer.

Also, is there any way to exclude PV's from having snapshots created all together?

Question about PING_URL Functionality

Hi,

First of all, this project is great. Nice and simple and seems to work without issue.

The PING_URL seems to work fine, but I was thinking about how it works and it raised an eyebrow. If there's only one PING_URL for the entire deployment, then it will ping for every single snapshot that is taken. This could mean that a single PV/PVC was never backed up, but your watcher, like deadmanssnitch would never know this, because it would be receiving pings from other snapshots.

Am I missing something, or is this the case?

Regards,
Andrew

AWS: empty snapshot names with USE_CLAIM_NAME=true

I've been running k8s-snapshots for a few months in a kops-managed Kubernetes cluster on AWS.

I pulled the latest k8s-snapshots Docker image. I believe I went from the 2018-02-09 image c5da4196dfa7 to the 2018-02-18 image d4c54dea2402 (current latest build).

With the old version and USE_CLAIM_NAME=true, my Snapshot names looked as expected. With the new version, Snapshot names are empty. I believe the k8s-snapshots version change is the only thing that changed in my cluster between the desired behavior and the buggy behavior.

If I have it right, the two changes merged between the old version and the new version are #51 --
which looks to be Google Cloud-only -- and #50. @wadahiro do you think this could be related to your boto3 changes?

TypeError: object of type 'Rule' has no len()

Hi,

Using the docker image from https://hub.docker.com/r/elsdoerfer/k8s-snapshots/, my container no longer starts. It now shows the error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/app/k8s_snapshots/__main__.py", line 101, in <module>
    sys.exit(main() or 0)
  File "/app/k8s_snapshots/__main__.py", line 15, in main
    config = k8s_snapshots.config.from_environ()
  File "/app/k8s_snapshots/config.py", line 128, in from_environ
    config.update(read_volume_config())
  File "/app/k8s_snapshots/config.py", line 181, in read_volume_config
    config['rules'] = list(filter(bool, map(read_volume, volumes)))
  File "/app/k8s_snapshots/config.py", line 175, in read_volume
    _log.info(events.Rule.ADDED_FROM_CONFIG, rule=rule)
  File "/usr/local/lib/python3.6/site-packages/structlog/_base.py", line 176, in _proxy_to_logger
    args, kw = self._process_event(method_name, event, event_kw)
  File "/usr/local/lib/python3.6/site-packages/structlog/_base.py", line 136, in _process_event
    event_dict = proc(self._logger, method_name, event_dict)
  File "/usr/local/lib/python3.6/site-packages/structlog/dev.py", line 181, in __call__
    event = _pad(event, self._pad_event) + self._styles.reset + " "
  File "/usr/local/lib/python3.6/site-packages/structlog/dev.py", line 36, in _pad
    missing = l - len(s)
TypeError: object of type 'Rule' has no len()

Any idea what this error is caused by? Has the syntax for anything changed? Just to give an example of what my config looks like

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: k8s-snapshots
  labels:
    app: k8s-snapshots
spec:
  replicas: 1
  selector:
    matchLabels:
      app: k8s-snapshots
  template:
    metadata:
      labels:
        app: k8s-snapshots
    spec:
      containers:
        - image: elsdoerfer/k8s-snapshots
          name: k8s-snapshots
          env:
          - name: GCLOUD_PROJECT
            value: <hidden>
          - name: VOLUMES
            value: "wordpress"
          - name: VOLUME_WORDPRESS_DELTAS
            value: "6h 1d 30d 180d"
          - name: VOLUME_WORDPRESS_ZONE
            value: "asia-east1-a"
...<more here>...
          - name: GCLOUD_JSON_KEYFILE_STRING
            valueFrom:
              secretKeyRef:
                name: snapshots-service-account
                key: snapshots-service-account.json

USE_ CLAIM_NAME does not work as expected in AWS

I have set env variable for container in the following way

          env:
             - name: USE_CLAIM_NAME
               value: "true"

while my EBS snapshots still have name of PV instead of PVC, I have not found any messages regarding this in the log. What is expected behavior, may be I set it wrong?

Super strange issue, async-related?

@joar Wonder if you have any insight here. There is a bug in the current code for me. If I kubectl edit a PersistentVolume, k8s-snapshots will not recognize that change. It will understand if I edit a PersistentVolumeClaim.

So I was looking into this, and I noticed that for both watch threads, it is actually quering the same resource, namely the last one that is starting watching (here, I turned it around, that is why it's watching the volumes, not the claims):

2017-08-12T13:51:46.197303Z foreign                        [urllib3.connectionpool] message=https://130.211.108.245:443 "GET /api/v1/persistentvolumes?watch=true HTTP/1.1" 200 None severity=DEBUG
2017-08-12T13:51:46.202315Z foreign                        [urllib3.connectionpool] message=https://130.211.108.245:443 "GET /api/v1/persistentvolumes?watch=true HTTP/1.1" 200 None severity=DEBUG

Correspondingly, if I log the actual events we receive, they are always PersistentVolume updates, even if it's a PersistentVolumeClaim watch thread:

got an event WatchEvent(type='ADDED', object=<PersistentVolume pvc-79960dea-d1c1-11e6-b958-42010af000bd>) <class 'pykube.objects.PersistentVolumeClaim'>
got an event WatchEvent(type='ADDED', object=<PersistentVolume pvc-77403b9f-d5e1-11e6-b958-42010af000bd>) <class 'pykube.objects.PersistentVolume'>
got an event WatchEvent(type='ADDED', object=<PersistentVolume pvc-e7242b6a-08da-11e7-924e-42010af000dd>) <class 'pykube.objects.PersistentVolumeClaim'>

Now the thing that kind of blows my mind:

If I change Kubernetes.watch from:

def watch(
            self,
            resource_type: Type[Resource],
    ) -> Iterable[_WatchEvent]:
        """
        Sync wrapper for :any:`pykube.query.Query().watch().object_stream()`
        """
        return resource_type.objects(self.client_factory()).watch().object_stream()

to

def watch(
            self,
            resource_type: Type[Resource],
    ) -> Iterable[_WatchEvent]:
        """
        Sync wrapper for :any:`pykube.query.Query().watch().object_stream()`
        """
        api = self.client_factory()
        return resource_type.objects(api).watch().object_stream()

It works. EVERY TIME. Recording to proof it: http://recordit.co/HIBKNSNlXS

How can that be?

Any reason I might be limited to 100 snapshots on GKE?

Hi all,

Having successfully deployed k8s-snapshot into my GKE environment yesterday, I started snapshotting all my PVs (I have about 30), with PT1H, P1D, P7D.

I've found though, that there are exactly 100 snapshots created, and no more (and some that I'd expect to see are missing). This doesn't seem to be a GKE limitation, since I've been able to create manual snapshot (number 101), and my project quota allows for up to 1000 snapshots.

Is there any default limitation in k8s-snapshots which prevents creation of more than 100 snapshots?

Thanks!
D

On GCP new pvc's are not picked up

I am on GCP running in an GKE cluster.

If I set an annotation backup.kubernetes.io/deltas: PT1H P2D P30D P180D on a PVC k8s-snapshots does not seem to notice and ignores it till it gets restarted.

server version: v1.7.8-gke.0
manifest snip:

      containers:
      - env:
        - name: LOG_LEVEL
          value: INFO
        - name: USE_CLAIM_NAME
          value: "true"
        image: elsdoerfer/k8s-snapshots:v2.0

Feature request: Option for last generation to never expire

I would like a way to specify that snapshots in the last generation should never be deleted, but kept indefinitely. Currently I can only do this via a workaround like giving P9999Y as the last interval, which is hacky. Ideally I'd like to be able to specify a special-case final interval as forever, which would indicate the final generation should have no end point.

Snapshotting stopped working: googleapiclient.errors.HttpError

Hi,

Thanks for this great project. All the annotated pvc's in my project stopped getting snapshots since may 10 and the k8s-snapshots instance is giving the error below, any idea why or how this can happen?

Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/snapshot.py", line 112, in make_backup File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/snapshot.py", line 215, in poll_for_status File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/snapshot.py", line 304, in get_snapshot_status File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/asyncutils.py", line 12, in run_in_executor File "/usr/local/lib/python3.6/concurrent/futures/thread.py", line 55, in run result = self.fn(*self.args, **self.kwargs) File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/snapshot.py", line 304, in <lambda> File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/backends/google.py", line 250, in get_snapshot_status File "/usr/local/lib/python3.6/site-packages/oauth2client/_helpers.py", line 133, in positional_wrapper return wrapped(*args, **kwargs) File "/usr/local/lib/python3.6/site-packages/googleapiclient/http.py", line 840, in execute raise HttpError(resp, content, uri=self.uri) googleapiclient.errors.HttpError: <HttpError 404 when requesting https://www.googleapis.com/compute/v1/projects/PROJECT_NAME/global/snapshots/SNAPSHOT_NAME?alt=json returned "The resource 'projects/PROJECT_NAME/global/snapshots/SNAPSHOT_NAME' was not found">

Thanks

Snapshots not creating

Hi,

I'm trying v1.0.1, using manual config, but no snapshots are created. Here's my latest config (testing snapshotting the only disk I have thus far):

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: k8s-snapshots
  labels:
    app: k8s-snapshots
spec:
  replicas: 1
  selector:
    matchLabels:
      app: k8s-snapshots
  template:
    metadata:
      labels:
        app: k8s-snapshots
    spec:
      containers:
        - image: elsdoerfer/k8s-snapshots:v1.0.1
          name: k8s-snapshots
          env:
          - name: GCLOUD_PROJECT
            value: manse-cloud
          - name: VOLUMES
            value: "kube-cert-manager"
          - name: VOLUME_KUBE_CERT_MANAGER_DELTAS
            value: "PT1H P1D P7D"
          - name: VOLUME_KUBE_CERT_MANAGER_ZONE
            value: "australia-southeast1-c"
          - name: LOG_LEVEL
            value: "DEBUG"
          - name: GCLOUD_JSON_KEYFILE_STRING
            valueFrom:
              secretKeyRef:
                name: snapshots-service-account
                key: snapshots-service-account.json

Here's the logs after an hour or two:

2017-08-21T00:12:42.233421Z rule.from-config               [k8s_snapshots.config] deltas_str=PT1H P1D P7D message=rule.from-config rule=Rule(name='kube-cert-manager', deltas=[datetime.timedelta(0, 3600), datetime.timedelta(1), datetime.timedelta(7)], gce_disk='kube-cert-manager', gce_disk_zone='australia-southeast1-c', source=None) severity=INFO volume_name=kube-cert-manager zone=australia-southeast1-c
2017-08-21T00:12:42.237344Z foreign                        [asyncio] message=Using selector: EpollSelector severity=DEBUG
2017-08-21T00:12:42.238683Z Gathering tasks                [k8s_snapshots.core] message=Gathering tasks severity=DEBUG tasks=[<Task pending coro=<scheduler() running at /usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/core.py:271>>, <Task pending coro=<backuper() running at /usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/core.py:289>>]
2017-08-21T00:12:42.239732Z scheduler.start                [k8s_snapshots.core] message=scheduler.start severity=DEBUG
2017-08-21T00:12:42.240466Z watch_schedule.start           [k8s_snapshots.core] message=watch_schedule.start severity=DEBUG
2017-08-21T00:12:42.241388Z backuper.start                 [k8s_snapshots.core] message=backuper.start severity=DEBUG
2017-08-21T00:12:42.242186Z rule.heartbeat                 [k8s_snapshots.core] message=rule.heartbeat rules=None severity=INFO
2017-08-21T00:12:42.243085Z volume-events.watch            [k8s_snapshots.core] message=volume-events.watch severity=DEBUG
2017-08-21T00:12:42.245651Z foreign                        [googleapiclient.discovery] message=URL being requested: GET https://www.googleapis.com/discovery/v1/apis/compute/v1/rest severity=INFO
2017-08-21T00:12:42.905296Z foreign                        [googleapiclient.discovery] message=URL being requested: GET https://www.googleapis.com/compute/v1/projects/manse-cloud/global/snapshots?filter=labels.created-by+eq+k8s-snapshots&alt=json severity=INFO
2017-08-21T00:12:42.906601Z foreign                        [oauth2client.transport] message=Attempting refresh to obtain initial access_token severity=INFO
2017-08-21T00:12:42.943009Z watch-resources.worker.start   [k8s_snapshots.kube] message=watch-resources.worker.start resource_type_name=PersistentVolume severity=DEBUG
2017-08-21T00:12:42.944380Z kube-config.from-service-account [k8s_snapshots.context] message=kube-config.from-service-account severity=INFO
2017-08-21T00:12:42.950273Z foreign                        [oauth2client.crypt] message=[<removed because I don't know if it's sensitive>] severity=DEBUG
2017-08-21T00:12:42.951193Z foreign                        [oauth2client.client] message=Refreshing access_token severity=INFO
2017-08-21T00:12:42.952846Z foreign                        [urllib3.connectionpool] message=Starting new HTTPS connection (1): 10.55.240.1 severity=DEBUG
2017-08-21T00:12:42.963228Z foreign                        [urllib3.connectionpool] message=https://10.55.240.1:443 "GET /api/v1/persistentvolumes?watch=true HTTP/1.1" 200 None severity=DEBUG
2017-08-21T00:12:44.012902Z watch_schedule.wait-for-both   [k8s_snapshots.core] message=watch_schedule.wait-for-both severity=DEBUG
2017-08-21T00:12:44.907999Z watch-resources.worker.start   [k8s_snapshots.kube] message=watch-resources.worker.start resource_type_name=PersistentVolumeClaim severity=DEBUG
2017-08-21T00:12:44.910447Z foreign                        [urllib3.connectionpool] message=Starting new HTTPS connection (1): 10.55.240.1 severity=DEBUG
2017-08-21T00:12:44.921444Z foreign                        [urllib3.connectionpool] message=https://10.55.240.1:443 "GET /api/v1/namespaces/default/persistentvolumeclaims?watch=true HTTP/1.1" 200 None severity=DEBUG
2017-08-21T00:22:42.244870Z rule.heartbeat                 [k8s_snapshots.core] message=rule.heartbeat rules=None severity=INFO
2017-08-21T00:32:42.246586Z rule.heartbeat                 [k8s_snapshots.core] message=rule.heartbeat rules=None severity=INFO                                                                                                                                                   
2017-08-21T00:42:42.247750Z rule.heartbeat                 [k8s_snapshots.core] message=rule.heartbeat rules=None severity=INFO                                                                                                                                                   
2017-08-21T00:47:36.623109Z watch-resources.worker.finalized [k8s_snapshots.kube] message=watch-resources.worker.finalized resource_type_name=PersistentVolume severity=DEBUG                                                                                                     
2017-08-21T00:47:36.715707Z watch-resources.done           [k8s_snapshots.kube] message=watch-resources.done resource_type_name=PersistentVolume severity=DEBUG                                                                                                                   
2017-08-21T00:52:42.249679Z rule.heartbeat                 [k8s_snapshots.core] message=rule.heartbeat rules=None severity=INFO                                                                                                                                                   
2017-08-21T01:01:47.579088Z watch-resources.worker.finalized [k8s_snapshots.kube] message=watch-resources.worker.finalized resource_type_name=PersistentVolumeClaim severity=DEBUG                                                                                                
2017-08-21T01:01:47.602920Z watch-resources.done           [k8s_snapshots.kube] message=watch-resources.done resource_type_name=PersistentVolumeClaim severity=DEBUG                                                                                                              
2017-08-21T01:01:47.603821Z sync-get-rules.done            [k8s_snapshots.core] message=sync-get-rules.done severity=DEBUG                                                                                                                                                        
2017-08-21T01:01:47.604225Z get-rules.done                 [k8s_snapshots.core] message=get-rules.done severity=DEBUG                                                                                                                                                             
2017-08-21T01:02:42.251460Z rule.heartbeat                 [k8s_snapshots.core] message=rule.heartbeat rules=None severity=INFO                                                                                                                                                   
2017-08-21T01:12:42.253261Z rule.heartbeat                 [k8s_snapshots.core] message=rule.heartbeat rules=None severity=INFO                                                                                                                                                   
2017-08-21T01:22:42.254266Z rule.heartbeat                 [k8s_snapshots.core] message=rule.heartbeat rules=None severity=INFO                                                                                                                                                   
2017-08-21T01:32:42.255648Z rule.heartbeat                 [k8s_snapshots.core] message=rule.heartbeat rules=None severity=INFO 

Any idea why no snapshots are being created? I've checked that the region is right, that the name of the disk is right, and the project is right, so I'm reasonably confident I haven't made a mistake there. I've gone over the documentation a couple of times to see if I missed anything obvious.

Also, a further question -- older versions created snapshots immediately. Should this one also be creating a snapshot immediately, or would it wait for an hour given the above deltas before creating the first?

Thanks!

Exception in callback

I get the following error in one of my clusters:

Exception in callback combine.<locals>.cb(<Task cancell...cutils.py:16>>) at /app/asyncutils.py:24
handle: <Handle combine.<locals>.cb(<Task cancell...cutils.py:16>>) at /app/asyncutils.py:24>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/asyncio/events.py", line 126, in _run
    self._callback(*self._args)
  File "/app/asyncutils.py", line 25, in cb
    if task.exception():
concurrent.futures._base.CancelledError

I'm trying to find out if one of the pv's has some unexpected data, but there's a lot of them so might be hard. Any pointers that could help tackle the issue?

Prior to this error the following debug messages were logged:

[2017-01-27 14:12:17.913451] DEBUG: daemon: Event in persistent volume stream: WatchEvent(type='ADDED', object=<PersistentVolume pvc-d2451259-e416-11e6-a7ce-42010af00060>)
[2017-01-27 14:12:17.913579] DEBUG: daemon: Volume pvc-d2451259-e416-11e6-a7ce-42010af00060 does not define backup deltas (via backup.kubernetes.io/deltas)

Service stops watching for new k8s events if k8s api server closes connection

Kubernetes.watch returns a pykube.query.WatchQuery.object_stream (https://github.com/kelproject/pykube/blob/master/pykube/query.py#L147). The behaviour of this iterator is to stream events indefinitely from the http request stream, until it ends. In theory it should not end, but ours is not a perfect world, and connections get interrupted, servers get moved or restarted, etc.

It looks like k8s-snapshots currently assumes this stream will never end, and when it does it is not re-opened, but the process stays running. get_rules finishes, but nothing else happens, and k8s-snapshots is now blind and not doing its job.

I think the correct fix would be to add a retry-forever loop to watch_resources (or rather, to _watch_resources_thread_wrapper) such that if the connection is closed (without error) it simply re-opens it.

Another related fix would be to die if either scheduler or backuper returns in daemon, rather than both, since one running without the other is always an error condition.

This is likely the cause of #27,
and I'm seeing it myself also:

{
  "message": "watch-resources.worker.finalized",
  "event": "watch-resources.worker.finalized",
  "resource_type_name": "PersistentVolumeClaim",
  "logger": "k8s_snapshots.kube",
  "severity": "DEBUG",
  "timestamp": "2018-08-23T22:38:00.095136Z"
}
{
  "message": "watch-resources.done",
  "event": "watch-resources.done",
  "resource_type_name": "PersistentVolumeClaim",
  "logger": "k8s_snapshots.kube",
  "severity": "DEBUG",
  "timestamp": "2018-08-23T22:38:00.173122Z"
}
{
  "message": "sync-get-rules.done",
  "event": "sync-get-rules.done",
  "logger": "k8s_snapshots.core",
  "severity": "DEBUG",
  "timestamp": "2018-08-23T22:38:00.173860Z"
}
{
  "message": "get-rules.done",
  "event": "get-rules.done",
  "logger": "k8s_snapshots.core",
  "severity": "DEBUG",
  "timestamp": "2018-08-23T22:38:00.174262Z"
}

Find a better way to identify the disk zone

Currently, we read from failure-domain.beta.kubernetes.io/zone. This always seemed problematic. Indeed I ran across a disk in my cluster that lacks this annotation. Possibly something changed in one of the various Kubernetes versions I was using. Some people mentioned things would be different if the disk is created by hand, not provisioned.

There are other options:

PV's that are not auto-provisioned are not backed up

In code I see you check if a volume is actually a GCE PD.

If it's auto-provisioned the pv.kubernetes.io/provisioned-by annotation is present, like this:

"annotations": {
  "kubernetes.io/createdby": "gce-pd-dynamic-provisioner",
  "pv.kubernetes.io/bound-by-controller": "yes",
  "pv.kubernetes.io/provisioned-by": "kubernetes.io/gce-pd",
  "volume.beta.kubernetes.io/storage-class": "fast"
}

However if you create the PV yourself with a manifest it's not, it would look something like this:

"annotations": {
  "backup.kubernetes.io/deltas": "1h 8h 2d 7d 30d",
  "kubectl.kubernetes.io/last-applied-configuration": "..."
}

As a workaround I could obviously add the annotations myself, but isn't there another way to detect whether it's a GCE PD? Besides it doesn't feel particularly safe to add one of those GCE PV controller annotations to my 'manually' created PV.

Restore Snapshots

Any docs on how to restore snapshots taken using this app? - my cluster is in AWS

Support backup definitions via a ThirdPartyResource

Because it was difficult to maintain while supporting multiple storage backends, the ability to manually define volumes to be backed up will be removed.

I envision that instead we have a third party resource which allows backup definitions. This will achieve a similar thing, but:

  • Does not require the daemon to be restarted to change.
  • Feels more Kubernetes-native.
  • Is easier to configure than through a large number of environment variables.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.