Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add dns record delete validation #221

Closed
wants to merge 2 commits into from

Conversation

mikenairn
Copy link
Member

closes #181

fix e2e multi record test issues

  • Fix an issue with the domain owners checks when multiple instances are being used, allOwners was being appended with multiple duplicates.
  • Add additional checks to ensure all remaining records are eventually updated with the correct domain owners.
  • Add logging of time taken around dnsrecord create/update/delete checks.

feat: Add dns record delete validation

Adds validation around the deletion of DNSRecord resources to reduce the chances of unexpected records being left in the external dns provider (AWS Routes53 etc..) after the record has been garbage collected. It now takes into account the possibly that multiple records could be manipulating the same zone record set simultaneously which can happen if
a significant number of DNSRecord resources were contributing to the same rootHost record set in the provider and were all deleted at roughly the same time.

The deletion logic was changed to always run validation checks, at random intervals, for an amount of time (currently 15 seconds) after the last time a change was detected. A change here can either be that it had to update the provider itself or that some status on the record was changed, the most important one being Status.DomainOwners. If either of these changes occur the delete validation loop of 15 seconds is restarted and will only move on to remove the finalizer once it reaches the 15 seconds duration without any changes having been made.

Note: Some providers, such as google, handle this better as they require the API requests to know the current state of the record you are updating/deleting before it will be accepted. Others such as AWS(Route53) unfortunately do not meaning that some things one record just deleted, may get added back by another depending on when it last queried the providers zone records.

Validation

Terminal 1:
Setup local env with multiple instances and watch all dnsrecords:

make local-setup DEPLOY=true DEPLOYMENT_SCOPE=namespace DEPLOYMENT_COUNT=20
kubectl get dnsrecord -w -o json -A | jq '.metadata.namespace + ":" + .metadata.name + " = deletedAt: " + (.metadata.deletionTimestamp | tostring) + " queuedAt: " + (.status.queuedAt | tostring) + " domainOwners: " + (.status.domainOwners | tostring)'

Terminal 2:
Tail all operator logs:

kubectl stern -l control-plane=dns-operator-controller-manager --all-namespaces

Terminal 3:
Run multi record e2e with 20 instances (AWS)

make test-e2e-multi TEST_DNS_ZONE_DOMAIN_NAME=<your aws domain> TEST_DNS_PROVIDER_SECRET_NAME=dns-provider-credentials-aws TEST_DNS_NAMESPACES=dns-operator DEPLOYMENT_COUNT=20

Run multi record e2e with 20 instances (GCP)

make test-e2e-multi TEST_DNS_ZONE_DOMAIN_NAME=<your google domain> TEST_DNS_PROVIDER_SECRET_NAME=dns-provider-credentials-gcp TEST_DNS_NAMESPACES=dns-operator DEPLOYMENT_COUNT=20

@mikenairn mikenairn changed the title feat: Add dns record delete validation WIP feat: Add dns record delete validation Sep 6, 2024
* Fix an issue with the doamin owners checks when multiple instances are
  being used, allOwners was being appended with multiple duplicates.
* Add additional checks to ensure all remaining records are eventually
  updated with the correct domain owners.
* Add logging of time taken around dnsrecord create/update/delete
  checks.

Signed-off-by: Michael Nairn <[email protected]>
// QueuedAt is a time when DNS record was received for the reconciliation
// QueuedAt is a time when DNS record was received for the reconciliation.
// During deletion QueuedAt is used to track the last time a change was detected when processing the delete request.
// Used for validation purposes to ensure the record is completely removed from the provider.
QueuedAt metav1.Time `json:"queuedAt,omitempty"`

// QueuedFor is a time when we expect a DNS record to be reconciled again
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think QueuedFor is used anywhere, should it be removed(in a different PR)?

// QueuedAt is a time when DNS record was received for the reconciliation
// QueuedAt is a time when DNS record was received for the reconciliation.
// During deletion QueuedAt is used to track the last time a change was detected when processing the delete request.
// Used for validation purposes to ensure the record is completely removed from the provider.
QueuedAt metav1.Time `json:"queuedAt,omitempty"`
Copy link
Member Author

@mikenairn mikenairn Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure about re-using this for a different purpose, but it didn't seem like a great idea adding yet another time field just for tracking the delete updates.

@mikenairn mikenairn changed the title WIP feat: Add dns record delete validation feat: Add dns record delete validation Sep 10, 2024
Adds validation around the deletion of DNSRecord resources to reduce the
chances of unexpected records being left in the external dns provider
(AWS Routes53 etc..) after the record has been garbage collected. It now
takes into account the possibly that multiple records could be
manipulating the same zone record set simultaneously which can happen if
a significant number of DNSRecord resources were contributing to the
same rootHost record set in the provider and were all deleted at roughly
the same time.

The deletion logic was changed to always queue for another
reconciliation loop an amount of time (currently 15 seconds) after the
last time a change was detected. A change can either be that it had
to update the provider itself or that some status on the record was
changed, the most important one being `Status.DomainOwners`. If either
of these changes occur the delete validation loop of 15 seconds is
restarted and will only move on to remove the finalizer once it reaches
the 15 seconds duration without any changes having been made.

Note: Some providers, such as google, handle this better as they require
the API requests to know the current state of the record you are
updating/deleting before it will be accepted. Others such as
AWS(Route53) unfortunately do not meaning that some things one record
just deleted, may get added back by another depending on when it last
queried the providers zone records.

Signed-off-by: Michael Nairn <[email protected]>
@mikenairn mikenairn marked this pull request as ready for review September 10, 2024 10:40
@mikenairn mikenairn closed this Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve DNSRecord deletion to better ensure changes are persisted
1 participant