Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Controller doesn't requeue ScheduledQueryRules #4293

Open
andreipantelimon opened this issue Sep 24, 2024 · 3 comments
Open

Bug: Controller doesn't requeue ScheduledQueryRules #4293

andreipantelimon opened this issue Sep 24, 2024 · 3 comments
Labels
bug 🪲 Something isn't working needs-triage 🔍 waiting-on-user-response Waiting on more information from the original user before progressing.

Comments

@andreipantelimon
Copy link

andreipantelimon commented Sep 24, 2024

Version of Azure Service Operator
v2.8.0

Describe the bug
Our use-case is to deploy an Azure AKS scale-unit, consisting of multiple resources, including:

  • AKS cluster
  • Log Analytics Workspace, deployed via terraform
  • ScheduledQueryRule, which ideally will be deployed by ASO.
    When we create a new scale-unit, we do introduce a direct dependency for LAW to be created before the ScheduledQueryRule.
    The first time ASO tries to create the resource it will fail, because it's waiting for the RG object to be created, but afterwards, it should fail and retry until the LAW object is created (eventual consistency).
    But after the first failure of LAW missing, it won't requeue the object. We think that the Bad Request response got from Azure is not handled properly and goes into a permanent failure.

To Reproduce
Steps to reproduce the behavior:

  • Deploy a ScheduledQueryRule, this should fail for LAW missing (expected)
  • Deploy the missing LAW and reference it in the ScheduledQueryRule by ArmId
  • The ScheduledQueryRule should reconcile now that the LAW is available, but it doesn't get requeued

Expected behavior
The controller should be able to retry missing resource requests, coded as Bad Requests.

Screenshots
These are the logs that we saw in the controller, but the object will not be reconciled.

I0918 13:38:25.265321       1 generic_reconciler.go:101] "msg"="Encountered error, re-queuing..." "logger"="controllers.ScheduledQueryRuleController" "name"="aso-flux-alerts" "namespace"="ns" "result"={"Requeue":false,"RequeueAfter":30000000000}
I0918 13:39:56.908898       1 generic_reconciler.go:101] "msg"="Encountered error, re-queuing..." "logger"="controllers.ScheduledQueryRuleController" "name"="aso-flux-alerts" "namespace"="ns" "result"={"Requeue":false,"RequeueAfter":0}

Additional context
We are also using CAPZ along with ASO, and we also have these kinds of resources there, which reference LAW via ArmId, and they will be eventually reconciled when the LAW is created. This should behave the same.

@andreipantelimon andreipantelimon added the bug 🪲 Something isn't working label Sep 24, 2024
@theunrepentantgeek
Copy link
Member

When ASO encounters a BadRequest response from ARM, it classifies it as either a fatal (not worth retrying because it will never work) or non-fatal (schedule for a retry later).

I expect that what's happening here is that the error needs to be reclassified as non-fatal so that our existing retry mechanisms will kick in. Fortunately, we have an extension mechanism (the interface extensions.ErrorClassifier) for exactly this.

To do this, we need the entire error that's returned from Azure - this should be prominent in your logs, just prior to the logs you quoted above.

@theunrepentantgeek theunrepentantgeek added the waiting-on-user-response Waiting on more information from the original user before progressing. label Sep 26, 2024
@andreipantelimon
Copy link
Author

Ok, let me rerun the tests once more to get accurate data, will come back with the errors

@andreipantelimon
Copy link
Author

andreipantelimon commented Sep 27, 2024

So, this is the sequence of the logs that we get:

I0927 13:33:16.553877       1 common.go:66] "msg"="Reconcile invoked" "annotations"={"serviceoperator.azure.com/credential-from":"aso-secret","serviceoperator.azure.com/operator-namespace":"capz-system","serviceoperator.azure.com/reconcile-policy":"manage"} "conditions"="[]" "creationTimestamp"="2024-09-27T13:33:16Z" "deletionTimestamp"=null "finalizers"=null "generation"=1 "kind"={"kind":"ScheduledQueryRule","apiVersion":"insights.azure.com/v1api20220615storage"} "logger"="controllers.ScheduledQueryRuleController" "name"="aso-flux-alerts" "namespace"="target-ns" "owner"={"group":"resources.azure.com","kind":"ResourceGroup","name":"target-rg"} "ownerReferences"=null "resourceVersion"="2199408997" "uid"="uid2"

Here, it's waiting a bit for the Owner RG to be created, 1-2 mins, then sends the resource to Azure.

I0927 13:34:17.890325       1 azure_generic_arm_reconciler_instance.go:316] "msg"="About to send resource to Azure" "azureName"="aso-flux-alerts" "logger"="controllers.ScheduledQueryRuleController" "name"="aso-flux-alerts" "namespace"="target-ns"
E0927 13:34:18.875269       1 generic_reconciler.go:388] "msg"="Encountered error impacting Ready condition" "error"="Reason: BadRequest, Severity: Error, RetryClassification: RetrySlow, Cause: Scope 'target-law' does not exist Activity ID: activity-id.: PUT https://management.azure.com/subscriptions/subscription-id/resourceGroups/target-rg/providers/Microsoft.Insights/scheduledQueryRules/aso-flux-alerts\n--------------------------------------------------------------------------------\nRESPONSE 400: 400 Bad Request\nERROR CODE: BadRequest\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"BadRequest\",\n    \"message\": \"Scope 'target-law' does not exist Activity ID: activity-id.\"\n  }\n}\n--------------------------------------------------------------------------------\n" "logger"="controllers.ScheduledQueryRuleController" "name"="aso-flux-alerts" "namespace"="target-ns"
I0927 13:34:18.875294       1 recorder.go:104] "msg"="Reason: BadRequest, Severity: Error, RetryClassification: RetrySlow, Cause: Scope 'target-law' does not exist Activity ID: activity-id.: PUT https://management.azure.com/subscriptions/subscription-id/resourceGroups/target-rg/providers/Microsoft.Insights/scheduledQueryRules/aso-flux-alerts\n--------------------------------------------------------------------------------\nRESPONSE 400: 400 Bad Request\nERROR CODE: BadRequest\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"BadRequest\",\n    \"message\": \"Scope 'target-law' does not exist Activity ID: activity-id.\"\n  }\n}\n--------------------------------------------------------------------------------\n" "logger"="events" "object"={"kind":"ScheduledQueryRule","namespace":"target-ns","name":"aso-flux-alerts","uid":"uid2","apiVersion":"insights.azure.com/v1api20220615storage","resourceVersion":"2199410161"} "reason"="CreateOrUpdateActionError" "type"="Warning"
I0927 13:34:18.934244       1 common.go:66] "msg"="updated resource in etcd" "annotations"={"serviceoperator.azure.com/credential-from":"aso-secret","serviceoperator.azure.com/latest-reconciled-generation":"1","serviceoperator.azure.com/operator-namespace":"capz-system","serviceoperator.azure.com/reconcile-policy":"manage","serviceoperator.azure.com/resource-id":"/subscriptions/subscription-id/resourceGroups/target-rg/providers/Microsoft.Insights/scheduledQueryRules/aso-flux-alerts"} "conditions"="[Condition [Ready], Status = \"False\", ObservedGeneration = 1, Severity = \"Error\", Reason = \"BadRequest\", Message = \"Scope 'target-law' does not exist Activity ID: activity-id.: PUT https://management.azure.com/subscriptions/subscription-id/resourceGroups/target-rg/providers/Microsoft.Insights/scheduledQueryRules/aso-flux-alerts\\n--------------------------------------------------------------------------------\\nRESPONSE 400: 400 Bad Request\\nERROR CODE: BadRequest\\n--------------------------------------------------------------------------------\\n{\\n  \\\"error\\\": {\\n    \\\"code\\\": \\\"BadRequest\\\",\\n    \\\"message\\\": \\\"Scope 'target-law' does not exist Activity ID: activity-id.\\\"\\n  }\\n}\\n--------------------------------------------------------------------------------\\n\", LastTransitionTime = \"2024-09-27 13:34:18 +0000 UTC\"]" "creationTimestamp"="2024-09-27T13:33:16Z" "deletionTimestamp"=null "finalizers"=["serviceoperator.azure.com/finalizer"] "generation"=1 "kind"={"kind":"ScheduledQueryRule","apiVersion":"insights.azure.com/v1api20220615storage"} "logger"="controllers.ScheduledQueryRuleController" "name"="aso-flux-alerts" "namespace"="target-ns" "owner"={"group":"resources.azure.com","kind":"ResourceGroup","name":"target-rg"} "ownerReferences"=[{"apiVersion":"resources.azure.com/v1api20200601storage","kind":"ResourceGroup","name":"target-rg","uid":"uid1"}] "resourceVersion"="2199410186" "uid"="uid2"
I0927 13:34:18.934324       1 generic_reconciler.go:101] "msg"="Encountered error, re-queuing..." "logger"="controllers.ScheduledQueryRuleController" "name"="aso-flux-alerts" "namespace"="target-ns" "result"={"Requeue": false,"RequeueAfter": 0}

A couple of minutes after, 2-3 mins max, the LAW is created, but this resource creation is never retried. Even after a couple of hours, no more logs regarding this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🪲 Something isn't working needs-triage 🔍 waiting-on-user-response Waiting on more information from the original user before progressing.
Projects
Development

No branches or pull requests

2 participants