Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes the requeueing bug which requeues a reconcile request in a wrong workqueue #259

Merged
merged 1 commit into from
Aug 18, 2023

Conversation

ulucinar
Copy link
Collaborator

@ulucinar ulucinar commented Aug 18, 2023

Description of your changes

During some load tests with upbound/[email protected], we've spotted a bug where a reconcile request might be requeued in a wrong workqueue. The bug manifests itself in the provider debug logs with messages similar to the following:

2023-08-10T14:47:56Z    DEBUG   provider-aws    Reconciling     {"controller": "managed/iam.aws.upbound.io/v1beta1, kind=rolepolicyattachment", "request": {"name":"rachel-sweeney-auth-service-default-policy-h5ck6"}}
2023-08-10T14:47:56Z    DEBUG   provider-aws    Cannot get managed resource     {"controller": "managed/iam.aws.upbound.io/v1beta1, kind=rolepolicyattachment", "request": {"name":"rachel-sweeney-auth-service-default-policy-h5ck6"}, "error": "RolePolicyAttachment.iam.aws.upbound.io \"rachel-sweeney-auth-service-default-policy-h5ck6\" not found"}
2023-08-10T14:48:10Z    DEBUG   provider-aws    Reconciling     {"controller": "managed/iam.aws.upbound.io/v1beta1, kind=policy", "request": {"name":"rachel-sweeney-auth-service-default-policy-h5ck6"}}
2023-08-10T14:48:10Z    DEBUG   provider-aws    External resource is up to date {"controller": "managed/iam.aws.upbound.io/v1beta1, kind=policy", "request": {"name":"rachel-sweeney-auth-service-default-policy-h5ck6"}, "uid": "7f8b9e51-34c4-475e-8653-53cb2adb364d", "version": "51370046", "external-name": "rachel-sweeney-auth-service-default-policy-h5ck6", "requeue-after": "2023-08-10T14:49:10Z"}
2023-08-10T14:50:04Z    DEBUG   provider-aws    Reconciling     {"controller": "managed/iam.aws.upbound.io/v1beta1, kind=policy", "request": {"name":"rachel-sweeney-auth-service-default-policy-h5ck6"}}
2023-08-10T14:50:04Z    DEBUG   provider-aws    External resource is up to date {"controller": "managed/iam.aws.upbound.io/v1beta1, kind=policy", "request": {"name":"rachel-sweeney-auth-service-default-policy-h5ck6"}, "uid": "7f8b9e51-34c4-475e-8653-53cb2adb364d", "version": "51370046", "external-name": "rachel-sweeney-auth-service-default-policy-h5ck6", "requeue-after": "2023-08-10T14:51:04Z"}

The issue here is that the reconciler is attempting to observe a resource using a wrong kind and this results in a not-found error. The root cause of the issue is that we are unintentionally sharing the event handlers across all the reconcilers in a given API group. The fix is to have separate event handlers per GVK.

This bug causes the reconciler to do unnecessary work. Reconciliation opportunities are lost for a given (GVK,name) pair because another pair is observed. This bug does not affect the periodic reconciles done at each poll period and hence an affected resource is eventually successfully reconciled. But successful reconciliation needs more attempts. This results in, for instance, increased TTR, and similar.

This PR also improves some debug logging that proved to be useful in interpreting the results we observed in the load tests. We are also planning to expose some further custom metrics reporting these measurements to improve monitoring.

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable to ensure this PR is ready for review.
  • Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

Performed load tests with 3000 MRs. Based on just one observation, initially it took ~2h for a resource affected by this bug to transition to the ready state under heavy load (99% total CPU (over all available CPU) utilization). With the fix, it took ~1857 s on average for 23 tries (experiments) with a single resource on top of 3000 MRs.

@ulucinar ulucinar requested a review from sergenyalcin as a code owner August 18, 2023 08:53
@ulucinar ulucinar requested review from sergenyalcin and removed request for sergenyalcin August 18, 2023 08:53
@@ -35,11 +36,12 @@ func Setup(mgr ctrl.Manager, o tjcontroller.Options) error {
if o.SecretStoreConfigGVK != nil {
cps = append(cps, connection.NewDetailsManager(mgr.GetClient(), *o.SecretStoreConfigGVK, connection.WithTLSConfig(o.ESSOptions.TLSConfig)))
}
eventHandler := handler.NewEventHandler(handler.WithLogger(o.Logger.WithValues("gvk", {{ .TypePackageAlias }}{{ .CRD.Kind }}_GroupVersionKind)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am writing this comment to make sure that what I understand is correct.

As far as I understand, before, we were using a universal event handler for each resource controller that belonged to the provider. With this PR, we allow each resource to use its event handler. This solves the error in the PR description.

Copy link
Collaborator Author

@ulucinar ulucinar Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct. Previously, the event handler was unintentionally per API group (an API group was sharing an event handler) or per-provider. We are now fixing this so that an event handler is per GVK.

Copy link
Member

@sergenyalcin sergenyalcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ulucinar LGTM!

@ulucinar ulucinar merged commit bd528e4 into crossplane:main Aug 18, 2023
@ulucinar ulucinar deleted the fix-requeue branch August 18, 2023 11:52
@csantanapr
Copy link

@ulucinar Could share the tests to do load tests?
What kind of MR do you create? Real cloud resource ?

@jeanduplessis
Copy link
Collaborator

@csantanapr we use the tool here: https://github.com/upbound/uptest/tree/main/cmd/perf

As we move Upjet and all the related providers and tooling upstream, we will also document them. No documentation exists currently for this tool, however, here's an example of how to call it:

go run main.go --mrs ./cmd/perf/manifests/ecr.yaml=<number-of-mrs-to-provision> --provider-pod <pod-name> --provider-namespace upbound-system --node <node-ip> --step-duration 1s

# For 10 MRs:
go run main.go --mrs ./cmd/perf/manifests/ecr.yaml=10 --provider-pod <pod-name> --provider-namespace upbound-system --node <node-ip> --step-duration 1s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants