Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataUpload Fails on Kubernetes 1.29 due to changed VSC SourceVolumeMode #8259

Open
msfrucht opened this issue Oct 3, 2024 · 10 comments
Open

Comments

@msfrucht
Copy link
Contributor

msfrucht commented Oct 3, 2024

What steps did you take and what happened:

Performed a DataUpload using Velero 1.14.1 on Kubernetes 1.29/OpenShift 4.16

The DataUpload fails with the error from the node-agent.

2024-10-03T09:21:23Z ERROR Reconciler error {"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "DataUpload": {"name":"be9184c2-b547-46f6-a4c0-c6a20d96e7e0-1","namespace":"ibm-backup-restore"}, "namespace": "ibm-backup-restore", "name": "be9184c2-b547-46f6-a4c0-c6a20d96e7e0-1", "reconcileID": "ea068f8e-f12b-4e14-8e69-ee44e8d72e19", "error": "error to delete volume snapshot content: error to assure VolumeSnapshotContent is deleted, snapcontent-f3d87ab0-5db8-49d9-bd91-f86c089be222: error to get VolumeSnapshotContent snapcontent-f3d87ab0-5db8-49d9-bd91-f86c089be222: client rate limiter Wait returned an error: context deadline exceeded", "errorVerbose": "client rate limiter Wait returned an error: context deadline exceeded\nerror to get VolumeSnapshotContent snapcontent-f3d87ab0-5db8-49d9-bd91-f86c089be222\ngithub.com/vmware-tanzu/velero/pkg/util/csi.EnsureDeleteVSC.func1\n\t/go/src/github.com/vmware-tanzu/velero/pkg/util/csi/volume_snapshot.go:229

The actual failure shows itself in the CSI driver logs for Ceph RBD and the snapshot-controller webhook pod.

E1003 17:01:23.202234       1 snapshot_controller.go:124] checkandUpdateContentStatus [snapcontent-f3d87ab0-5db8-49d9-bd91-f86c089be222]: error occurred failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-f3d87ab0-5db8-49d9-bd91-f86c089be222: "snapshot controller failed to update snapcontent-f3d87ab0-5db8-49d9-bd91-f86c089be222 on API server: admission webhook \"volumesnapshotclasses.snapshot.storage.k8s.io\" denied the request: Spec.SourceVolumeMode is immutable but was changed from Filesystem to nil"
E1003 17:01:23.202271       1 snapshot_controller_base.go:265] could not sync content "snapcontent-f3d87ab0-5db8-49d9-bd91-f86c089be222": failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-f3d87ab0-5db8-49d9-bd91-f86c089be222: "snapshot controller failed to update snapcontent-f3d87ab0-5db8-49d9-bd91-f86c089be222 on API server: admission webhook \"volumesnapshotclasses.snapshot.storage.k8s.io\" denied the request: Spec.SourceVolumeMode is immutable but was changed from Filesystem to nil"

During creation of the backup VSC the field Spec.SourceVolumeMode is not copied resulting in the failure. Newer versions of the snapshot-controller verify the SourceVolumeMode field against previous versions of the object.

What did you expect to happen:
DataUpload to succeed.

If you are using velero v1.7.0+:
The Velero Backup object has been deleted. I only have access to the velero and node-agent logs. The node-agent logs were the only ones of value for this issue.

node-agent-logs.zip

Anything else you would like to add:

Webhook logs.
webhook-logs.zip

Environment:

  • Velero version (use velero version): Velero 1.14
  • Velero features (use velero client config get features): EnableCSI
  • Kubernetes version (use kubectl version): 1.29
  • Kubernetes installer & version: unknown
  • Cloud provider or hardware configuration: Red Hat OpenShift 4.16
  • OS (e.g. from /etc/os-release): Red Hat Core

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@shubham-pampattiwar
Copy link
Collaborator

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Oct 6, 2024

Good catch!

Created another issue to collect the error messages in backup/source VS/VSC to help on troubleshooting #8267.

@blackpiglet
Copy link
Contributor

blackpiglet commented Oct 8, 2024

@msfrucht @shubham-pampattiwar
I have a little different opinion of the fix of PR #8261.
It may not be enough to fix the issue in your scenario.

The fix is adding the source volume mode in the PVC VSC, but the reported error is for the VolumeSnapshotContent created in the source namespace.
So the error should happen for the snapcontent-f3d87ab0-5db8-49d9-bd91-f86c089be222. Its corresponding VolumeSnapshot should be in the namespace ibm-backup-restore.
The backup VSC's corresponding VS is created in the namespace where the Velero server pod is.

2024-10-03T09:21:23Z	ERROR	Reconciler error	{"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "DataUpload": {"name":"be9184c2-b547-46f6-a4c0-c6a20d96e7e0-1","namespace":"ibm-backup-restore"}, "namespace": "ibm-backup-restore", "name": "be9184c2-b547-46f6-a4c0-c6a20d96e7e0-1", "reconcileID": "ea068f8e-f12b-4e14-8e69-ee44e8d72e19", "error": "error to delete volume snapshot content: error to assure VolumeSnapshotContent is deleted, snapcontent-f3d87ab0-5db8-49d9-bd91-f86c089be222: error to get VolumeSnapshotContent snapcontent-f3d87ab0-5db8-49d9-bd91-f86c089be222: client rate limiter Wait returned an error: context deadline exceeded", "errorVerbose": "client rate limiter Wait returned an error: context deadline exceeded\nerror to get VolumeSnapshotContent snapcontent-f3d87ab0-5db8-49d9-bd91-f86c089be222\ngithub.com/vmware-tanzu/velero/pkg/util/csi.EnsureDeleteVSC.func1\n\t/go/src/github.com/vmware-tanzu/velero/pkg/util/csi/volume_snapshot.go:229

@msfrucht
Copy link
Contributor Author

msfrucht commented Oct 8, 2024

VolumeSnapshotContent objects are not namespaced. The logging is to indicate the VolumeSnapshot's namespace.

This is similar to the relationship of PVC and PV.

During backup the VolumeSnapshot is moved to the Velero install namespace. The VolumeSnapshotContent is simply copied with a new name, no namespace.

During backup the VolumeSnapshotContent is copied and that copy fails to pickup that SourceVolumeMode.

@reasonerjt
Copy link
Contributor

This should not block v1.15 rc.
Let's find the root cause b/c it looks like issues in CSI snapshotter controller and the webhooks.

@msfrucht
Copy link
Contributor Author

msfrucht commented Oct 9, 2024

I will need to check what part of the puzzle is responsible for setting a SourceVolumeMode whether the external-snapshotter or the CSI driver.

The lack of SourceVolumeMode still would have caused a failure on copying the VolumeSnapshotContents if it had been set.

@msfrucht
Copy link
Contributor Author

msfrucht commented Oct 9, 2024

It is the responsibility of the external-snapshotter. https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3141-prevent-volume-mode-conversion

With this change, the controller will fetch the Spec.PersistentVolumeMode of the PV and add that to newly introduced Spec.SourceVolumeMode field of the VolumeSnapshotContent to be created.

That appears to be working correctly. The SubjectAdmissionReview shows that the ServiceAccount requesting the change is ceph-csi system:serviceaccount:rook-ceph:rook-csi-rbd-provisioner-sa which provides the CSI driver as part of a Rook.io install.

I will need to open an issue with ceph-csi.

@msfrucht
Copy link
Contributor Author

msfrucht commented Oct 9, 2024

Ceph-csi has already updated to external-snapshotter v8.0.0 in development ceph/ceph-csi@c48f5bf

There are no existing releases with external-snapshotter v8.0.0.

@Lyndon-Li
Copy link
Contributor

@msfrucht
Do you see this problem in a standard OpenShit env?
Is this problem 100% reproducible in OpenShift?

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Oct 12, 2024

Below is the object and oldobject reported by the webhook when the request was denied.
The external-snapshotter is trying to remove the snapshot.storage.kubernetes.io/volumesnapshot-being-created.
However, this operation is a Patch operation.
I also searched both Velero code(1.14) and the external-snapshotter code, there is no code to call Update against VSC. Either no code tries to set sourceVolumeMode after the VSC is created.

While for Patch operation, the new object passed to webhook is created by the API server.
So this problem may be caused by a version mismatch of API server and exnternal-snapshotter. That is:

  • The ObjectCreater of the API server creates a new VSC object without sourceVolumeMode field and then applies the patch
  • This new object is then passed to the webhook of external-snapshotter
  • From the external-snapshotter, the sourceVolumeMode field exists in the VSC CRD, so it regards it as nil
  • Then the validation fails since the old object's sourceVolumeMode is FileSystem
    "object":{
      "apiVersion":"snapshot.storage.k8s.io/v1",
      "kind":"VolumeSnapshotContent",
      "metadata":{
        "annotations":{
          "snapshot.storage.kubernetes.io/deletion-secret-name":"rook-csi-rbd-provisioner",
          "snapshot.storage.kubernetes.io/deletion-secret-namespace":"rook-ceph",
          "snapshot.storage.kubernetes.io/volumesnapshot-being-deleted":"yes"
        },
        "creationTimestamp":"2024-10-03T09:11:12Z",
        "deletionGracePeriodSeconds":0,
        "deletionTimestamp":"2024-10-03T09:11:23Z",
        "finalizers":["snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection"],
        "generation":4,
        "managedFields":[
          {
            "apiVersion":"snapshot.storage.k8s.io/v1",
            "fieldsType":"FieldsV1",
            "fieldsV1":{
              "f:spec":{
                "f:deletionPolicy":{}
              }
            },
            "manager":"node-agent-server",
            "operation":"Update",
            "time":"2024-10-03T09:11:21Z"
          },
          {
            "apiVersion":"snapshot.storage.k8s.io/v1",
            "fieldsType":"FieldsV1",
            "fieldsV1":{
              "f:metadata":{
                "f:annotations":{
                  ".":{},
                  "f:snapshot.storage.kubernetes.io/deletion-secret-name":{},
                  "f:snapshot.storage.kubernetes.io/deletion-secret-namespace":{},
                  "f:snapshot.storage.kubernetes.io/volumesnapshot-being-deleted":{}
                },
                "f:finalizers":{
                  ".":{},
                  "v:\"snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection\"":{}
                }
              },
              "f:spec":{
                ".":{},
                "f:driver":{},
                "f:source":{
                  ".":{},
                  "f:volumeHandle":{}
                },
                "f:volumeSnapshotClassName":{},
                "f:volumeSnapshotRef":{}
              }
            },
            "manager":"snapshot-controller",
            "operation":"Update",
            "time":"2024-10-03T09:11:21Z"
          },
          {
            "apiVersion":"snapshot.storage.k8s.io/v1",
            "fieldsType":"FieldsV1",
            "fieldsV1":{
              "f:status":{
                ".":{},
                "f:creationTime":{},
                "f:readyToUse":{},
                "f:restoreSize":{},
                "f:snapshotHandle":{}
              }
            },
            "manager":"csi-snapshotter",
            "operation":"Update",
            "subresource":"status",
            "time":"2024-10-03T16:31:48Z"
          }
        ],
        "name":"snapcontent-f3d87ab0-5db8-49d9-bd91-f86c089be222",
        "resourceVersion":"658477",
        "uid":"b8610979-2005-4912-8e2b-4acfaa980615"
      },
      "spec":{
        "deletionPolicy":"Retain",
        "driver":"rook-ceph.rbd.csi.ceph.com",
        "source":{
          "volumeHandle":"0001-0009-rook-ceph-0000000000000002-9b2a2bdf-8158-11ef-9bf8-0a580afe141d"
        },
        "volumeSnapshotClassName":"csi-rbdplugin-snapclass",
        "volumeSnapshotRef":{
          "apiVersion":"snapshot.storage.k8s.io/v1",
          "kind":"VolumeSnapshot",
          "name":"2819d226-2a1b-4cfa-a79e-9359a6ba8730-1727946672.3988628",
          "namespace":"filebrowser1",
          "resourceVersion":"224951",
          "uid":"f3d87ab0-5db8-49d9-bd91-f86c089be222"
        }
      },
      "status":{
        "creationTime":1727946674746281252,
        "readyToUse":true,
        "restoreSize":10737418240,
        "snapshotHandle":"0001-0009-rook-ceph-0000000000000002-70a08f90-8167-11ef-bf91-0a580afe0c17"
      }
    },
    "oldObject":{
      "apiVersion":"snapshot.storage.k8s.io/v1",
      "kind":"VolumeSnapshotContent",
      "metadata":{
        "annotations":{
          "snapshot.storage.kubernetes.io/deletion-secret-name":"rook-csi-rbd-provisioner",
          "snapshot.storage.kubernetes.io/deletion-secret-namespace":"rook-ceph",
          "snapshot.storage.kubernetes.io/volumesnapshot-being-created":"yes",
          "snapshot.storage.kubernetes.io/volumesnapshot-being-deleted":"yes"
        },"creationTimestamp":"2024-10-03T09:11:12Z",
        "deletionGracePeriodSeconds":0,
        "deletionTimestamp":"2024-10-03T09:11:23Z",
        "finalizers":["snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection"],
        "generation":3,
        "managedFields":[
          {
            "apiVersion":"snapshot.storage.k8s.io/v1",
            "fieldsType":"FieldsV1",
            "fieldsV1":{
              "f:metadata":{
                "f:annotations":{"f:snapshot.storage.kubernetes.io/volumesnapshot-being-created":{}}
              }
            },
            "manager":"csi-snapshotter",
            "operation":"Update",
            "time":"2024-10-03T09:11:14Z"
          },
          {
            "apiVersion":"snapshot.storage.k8s.io/v1",
            "fieldsType":"FieldsV1",
            "fieldsV1":{
              "f:spec":{
                "f:deletionPolicy":{}
              }
            },
            "manager":"node-agent-server",
            "operation":"Update",
            "time":"2024-10-03T09:11:21Z"
          },
          {
            "apiVersion":"snapshot.storage.k8s.io/v1",
            "fieldsType":"FieldsV1",
            "fieldsV1":{
              "f:metadata":{
                "f:annotations":{
                  ".":{},
                  "f:snapshot.storage.kubernetes.io/deletion-secret-name":{},
                  "f:snapshot.storage.kubernetes.io/deletion-secret-namespace":{},
                  "f:snapshot.storage.kubernetes.io/volumesnapshot-being-deleted":{}
                },
                "f:finalizers":{
                  ".":{},
                  "v:\"snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection\"":{}
                }
              },
              "f:spec":{
                ".":{},
                "f:driver":{},
                "f:source":{
                  ".":{},
                  "f:volumeHandle":{}
                },
                "f:sourceVolumeMode":{},
                "f:volumeSnapshotClassName":{},
                "f:volumeSnapshotRef":{}
              }
            },
            "manager":"snapshot-controller",
            "operation":"Update",
            "time":"2024-10-03T09:11:21Z"
          },
          {
            "apiVersion":"snapshot.storage.k8s.io/v1",
            "fieldsType":"FieldsV1",
            "fieldsV1":{
              "f:status":{
                ".":{},
                "f:creationTime":{},
                "f:readyToUse":{},
                "f:restoreSize":{},
                "f:snapshotHandle":{}
              }
            },
            "manager":"csi-snapshotter",
            "operation":"Update",
            "subresource":"status",
            "time":"2024-10-03T16:31:48Z"
          }
        ],
        "name":"snapcontent-f3d87ab0-5db8-49d9-bd91-f86c089be222",
        "resourceVersion":"658477",
        "uid":"b8610979-2005-4912-8e2b-4acfaa980615"
      },
      "spec":{
        "deletionPolicy":"Retain",
        "driver":"rook-ceph.rbd.csi.ceph.com",
        "source":{
          "volumeHandle":"0001-0009-rook-ceph-0000000000000002-9b2a2bdf-8158-11ef-9bf8-0a580afe141d"
        },
        "sourceVolumeMode":"Filesystem",
        "volumeSnapshotClassName":"csi-rbdplugin-snapclass",
        "volumeSnapshotRef":{
          "apiVersion":"snapshot.storage.k8s.io/v1",
          "kind":"VolumeSnapshot",
          "name":"2819d226-2a1b-4cfa-a79e-9359a6ba8730-1727946672.3988628",
          "namespace":"filebrowser1",
          "resourceVersion":"224951",
          "uid":"f3d87ab0-5db8-49d9-bd91-f86c089be222"
        }
      },
      "status":{
        "creationTime":1727946674746281252,
        "readyToUse":true,
        "restoreSize":10737418240,
        "snapshotHandle":"0001-0009-rook-ceph-0000000000000002-70a08f90-8167-11ef-bf91-0a580afe0c17"
      }
    },

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants