Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix syncing the aggregated snapshot size between CNSVolumeInfo and CNS query result #3073

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

deepakkinni
Copy link
Contributor

@deepakkinni deepakkinni commented Sep 30, 2024

What this PR does / why we need it:
Fix syncing the aggregated snapshot size between CNSVolumeInfo and CNS query result

General Changes:

  1. Reduces logging. I observed a perf bug with a lot of log spew. Changed it to debug.
  2. Improved logging for quota. Changed the display to readable values instead of bytes.

Bug Fixes:

  1. Fixed bugs that synced CNSVolumeInfo with CNS query result:
  2. validateAndCorrectVolumeInfoSnapshotDetails should not error out, this will prevent other parts of full sync from running
  3. If there are failures we proceed to syncing other volumes rather than erroring out
  4. CNS always returned value in Mb, while CNSVolumeInfo always held in bytes, there was always a mismatch and there would always be patching, this was incorrect.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Testing done:

Changed logging for easier reading

{"level":"info","time":"2024-09-30T20:09:17.190657406Z","caller":"syncer/metadatasyncer.go:3518","msg":"cnsVolumeInfoCRUpdated: aggregated snapshot size increased by 108Mi, increased Used field for storagepolicyusage CR: vsan-default-storage-policy-snapshot-usage","TraceId":"9c668ef6-c4f5-4db6-8229-ab582a3b62f3"}
{"level":"info","time":"2024-09-30T20:09:17.198981664Z","caller":"syncer/metadatasyncer.go:3518","msg":"cnsVolumeInfoCRUpdated: aggregated snapshot size increased by 108Mi, increased Used field for storagepolicyusage CR: vsan-default-storage-policy-snapshot-usage","TraceId":"751d9b7f-69bd-4288-9b37-3d601b93de8d"}
{"level":"info","time":"2024-09-30T20:09:17.206634972Z","caller":"syncer/metadatasyncer.go:3518","msg":"cnsVolumeInfoCRUpdated: aggregated snapshot size increased by 108Mi, increased Used field for storagepolicyusage CR: vsan-default-storage-policy-snapshot-usage","TraceId":"9f4df210-aabe-443a-9461-3fce41546aed"}

Deployed changes on live setup and ensured that CNSVolumeInfo was being fixed appropriately.

{"level":"info","time":"2024-09-30T20:09:17.072054174Z","caller":"cnsvolumeinfo/cnsvolumeinfoservice.go:304","msg":"attempt: 1, Successfully patched the cnsvolumeinfo \"57268eaa-adff-4c9b-a466-62b43b9d21da\" namespace \"vmware-system-csi\"","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}
{"level":"info","time":"2024-09-30T20:09:17.072079011Z","caller":"syncer/fullsync.go:841","msg":"Updated CNSvolumeInfo with Snapshot details successfully for volume 57268eaa-adff-4c9b-a466-62b43b9d21da","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}
{"level":"info","time":"2024-09-30T20:09:17.072179479Z","caller":"syncer/fullsync.go:799","msg":"Aggregated Snapshot size mismatch for volume 96210bac-bc06-4229-9a35-6783e4fce39a, 0 in CnsVolumeInfo and 108MB in CNS","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}
{"level":"info","time":"2024-09-30T20:09:17.072189017Z","caller":"syncer/fullsync.go:811","msg":"Update aggregatedSnapshotCapacity for volume 96210bac-bc06-4229-9a35-6783e4fce39a to 108MB","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}
{"level":"info","time":"2024-09-30T20:09:17.072197192Z","caller":"syncer/fullsync.go:820","msg":"snapshot operation completion time unavailable for volumeID 96210bac-bc06-4229-9a35-6783e4fce39a, will use current time 2024-09-30 20:09:17.072194117 +0000 UTC m=+1.245091382 instead","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}
{"level":"info","time":"2024-09-30T20:09:17.072204676Z","caller":"common/util.go:459","msg":"retrieved aggregated snapshot capacity 108 for volume \"96210bac-bc06-4229-9a35-6783e4fce39a\"","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}
{"level":"info","time":"2024-09-30T20:09:17.077780548Z","caller":"cnsvolumeinfo/cnsvolumeinfoservice.go:304","msg":"attempt: 1, Successfully patched the cnsvolumeinfo \"96210bac-bc06-4229-9a35-6783e4fce39a\" namespace \"vmware-system-csi\"","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}
{"level":"info","time":"2024-09-30T20:09:17.077791137Z","caller":"syncer/metadatasyncer.go:3518","msg":"cnsVolumeInfoCRUpdated: aggregated snapshot size increased by 144Mi, increased Used field for storagepolicyusage CR: vsan-default-storage-policy-snapshot-usage","TraceId":"eccd165f-ff82-4559-b603-b7ccc5cb1891"}
{"level":"info","time":"2024-09-30T20:09:17.077801948Z","caller":"syncer/fullsync.go:841","msg":"Updated CNSvolumeInfo with Snapshot details successfully for volume 96210bac-bc06-4229-9a35-6783e4fce39a","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}

Verified that volumes that were not in sync was determined correctly

{"level":"info","time":"2024-09-30T20:09:17.077810253Z","caller":"syncer/fullsync.go:847","msg":"Number of volumes with synced aggregated snapshot size with CNS 957","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}
{"level":"info","time":"2024-09-30T20:09:17.077818228Z","caller":"syncer/fullsync.go:848","msg":"Number of volumes with out-of-sync aggregated snapshot size with CNS 67","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}

Incorrect SPU before fix:

apiVersion: cns.vmware.com/v1alpha2
kind: StoragePolicyUsage
metadata:
  creationTimestamp: "2024-09-26T22:53:32Z"
  generation: 1
  name: vsan-default-storage-policy-snapshot-usage
  namespace: ns1
  resourceVersion: "3726520"
  uid: 66340f39-712b-4300-869d-cae84edbb5f4
spec:
  resourceApiGroup: snapshot.storage.k8s.io
  resourceExtensionName: snapshot.cns.vsphere.vmware.com
  resourceKind: VolumeSnapshot
  storageClassName: vsan-default-storage-policy
  storagePolicyId: aa6d5a82-1c88-45da-85d3-3d74b91a5bad
status:
  quotaUsage:
    reserved: "0"
    used: "0"

Corrected SPU after fix:

apiVersion: cns.vmware.com/v1alpha2
kind: StoragePolicyUsage
metadata:
  creationTimestamp: "2024-09-26T22:53:32Z"
  generation: 1
  name: vsan-default-storage-policy-snapshot-usage
  namespace: ns1
  resourceVersion: "3752696"
  uid: 66340f39-712b-4300-869d-cae84edbb5f4
spec:
  resourceApiGroup: snapshot.storage.k8s.io
  resourceExtensionName: snapshot.cns.vsphere.vmware.com
  resourceKind: VolumeSnapshot
  storageClassName: vsan-default-storage-policy
  storagePolicyId: aa6d5a82-1c88-45da-85d3-3d74b91a5bad
status:
  quotaUsage:
    reserved: "0"
    used: 8172Mi

Verified that next cycle of full sync detects that the usage is correct

{"level":"info","time":"2024-09-30T20:39:16.709799502Z","caller":"syncer/fullsync.go:847","msg":"Number of volumes with synced aggregated snapshot size with CNS 1024","TraceId":"92e8af49-9cd0-48ca-a95c-6cee81cac6b6"}
{"level":"info","time":"2024-09-30T20:39:16.709836211Z","caller":"syncer/fullsync.go:848","msg":"Number of volumes with out-of-sync aggregated snapshot size with CNS 0","TraceId":"92e8af49-9cd0-48ca-a95c-6cee81cac6b6"}
{

Special notes for your reviewer:

Release note:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 30, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deepakkinni

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants