Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero full backup failling because of ArgoCD pruning VolumeSnapshots resources #273

Closed
ricsanfre opened this issue Jan 21, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@ricsanfre
Copy link
Owner

Issue description

Velero full backup fails message 'Partially failed'

  • velero backup describe command output:
Name:         full21012024
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.28.2+k3s1
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=28

Phase:  PartiallyFailed (run `velero backup logs full21012024` for more information)


Warnings:
  Velero:   
            
  Cluster:    <none>
  Namespaces: <none>

Errors:
  Velero:    name: /my-cluster-kafka-0 error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=kafka, name=velero-data-0-my-cluster-kafka-0-bw4g7): rpc error: code = Unknown desc = failed to get volumesnapshot kafka/velero-data-0-my-cluster-kafka-0-bw4g7: volumesnapshots.snapshot.storage.k8s.io "velero-data-0-my-cluster-kafka-0-bw4g7" not found
             name: /my-cluster-kafka-1 error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=kafka, name=velero-data-0-my-cluster-kafka-1-p4qkk): rpc error: code = Unknown desc = failed to get volumesnapshot kafka/velero-data-0-my-cluster-kafka-1-p4qkk: volumesnapshots.snapshot.storage.k8s.io "velero-data-0-my-cluster-kafka-1-p4qkk" not found
             name: /my-cluster-kafka-2 error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=kafka, name=velero-data-0-my-cluster-kafka-2-t4bgr): rpc error: code = Unknown desc = failed to get volumesnapshot kafka/velero-data-0-my-cluster-kafka-2-t4bgr: volumesnapshots.snapshot.storage.k8s.io "velero-data-0-my-cluster-kafka-2-t4bgr" not found
             name: /my-cluster-zookeeper-0 error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=kafka, name=velero-data-my-cluster-zookeeper-0-nl5wr): rpc error: code = Unknown desc = failed to get volumesnapshot kafka/velero-data-my-cluster-zookeeper-0-nl5wr: volumesnapshots.snapshot.storage.k8s.io "velero-data-my-cluster-zookeeper-0-nl5wr" not found
             name: /my-cluster-zookeeper-1 error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=kafka, name=velero-data-my-cluster-zookeeper-1-b22zq): rpc error: code = Unknown desc = failed to get volumesnapshot kafka/velero-data-my-cluster-zookeeper-1-b22zq: volumesnapshots.snapshot.storage.k8s.io "velero-data-my-cluster-zookeeper-1-b22zq" not found
            
            
            
  Cluster:    <none>
  Namespaces: <none>

Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          false
Data Mover:                  velero

TTL:  720h0m0s

CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  4h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2024-01-21 11:13:16 +0100 CET
Completed:  2024-01-21 11:31:52 +0100 CET

Expiration:  2024-02-20 11:13:16 +0100 CET

Total items to be backed up:  20754
Items backed up:              20754

Backup Item Operations:  26 of 28 completed successfully, 2 failed (specify --details for more information)
Velero-Native Snapshots: <none included>
  • velero backup logs command output
time="2024-01-21T10:17:28Z" level=error msg="error getting volumesnapshot kafka/velero-data-my-cluster-zookeeper-2-frf48: volumesnapshots.snapshot.storage.k8s.io \"velero-data-my-cluster-zookeeper-2-frf48\" not found" backup=velero/full21012024 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/backup/volumesnapshot_action.go:234" pluginName=velero-plugin-for-csi
time="2024-01-21T10:17:28Z" level=warning msg="VolumeSnapshot has a temporary error Failed to check and update snapshot content: failed to take snapshot of the volume pvc-dc8d66a9-882a-471c-8f3c-6a7607d054cd: \"timestamp: nil Timestamp\". Snapshot controller will retry later." backup=velero/full21012024 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/backup/volumesnapshot_action.go:250" pluginName=velero-plugin-for-csi
time="2024-01-21T10:17:28Z" level=error msg=0 backup=velero/full21012024 logSource="pkg/controller/backup_controller.go:675"
time="2024-01-21T10:17:28Z" level=error msg=1 backup=velero/full21012024 logSource="pkg/controller/backup_controller.go:675"

Initial triage

The error could be related to vmware-tanzu/velero#4330. Since full backup process is taking more that 20 minutes to finish, meanwhile some of the resources created by the backup process are deleted and that is why the backup fails.
As it is hinted in this comment vmware-tanzu/velero#4330 (comment), ArgoCD auto-synch policy is pruning VolumeSnapshot and VolumeSnapshotContent resources that are created automatically by backup process.

The way to solve this issue is as indicated by the comment to make ArgoCD to ignore the VolumeSnapshot and VolumeSnapshotContent resources during the synchronization process.

@ricsanfre ricsanfre added the bug Something isn't working label Jan 21, 2024
@ricsanfre
Copy link
Owner Author

Solution

Excluding VolumeSnapshot and VolumeSnapshotContent from ArgoCD synchronization fixes this error in logs.

To do so, his configuration need to be added to ArgoCD helm chart:

configs:
  cm:
    ## Ignore resources
    # https://argo-cd.readthedocs.io/en/stable/operator-manual/declarative-setup/#resource-exclusioninclusion
    # Ignore VolumeSnapshot and VolumeSnapshotContent: Created by backup processes.
    resource.exclusions: |
      - apiGroups:
        - snapshot.storage.k8s.io
        kinds:
        - VolumeSnapshot
        - VolumeSnapshotContent
        clusters:
        - "*"

Still there is an issue with Snapshots created by Velero backup. This issue is tracked in #239

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant