Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine with errorZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS is not flagged as user error #766

Open
AleksandarSavchev opened this issue May 21, 2024 · 2 comments
Labels
area/ops-productivity Operator productivity related (how to improve operations) kind/bug Bug platform/gcp Google cloud platform/infrastructure

Comments

@AleksandarSavchev
Copy link
Member

How to categorize this issue?

/area ops-productivity
/kind bug
/platform gcp

What happened:
With @ialidzhikov we found that an machine error such as

status:
  currentStatus:
    lastUpdateTime: "2024-05-21T12:57:28Z"
    phase: CrashLoopBackOff
  lastOperation:
    description: 'Cloud provider message - machine codes error: code = [ResourceExhausted]
      message = [Create machine "machine"
      failed: The zone ''zone'' does not have
      enough resources available to fulfill the request.  ''(resource type:compute)''.]'
    errorCode: ResourceExhausted
    lastUpdateTime: "2024-05-21T12:57:28Z"
    state: Failed
    type: Create

is not properly categorised as user error since it should be matched by

quotaExceededRegexp = regexp.MustCompile(`(?i)((?:^|[^t]|(?:[^s]|^)t|(?:[^e]|^)st|(?:[^u]|^)est|(?:[^q]|^)uest|(?:[^e]|^)quest|(?:[^r]|^)equest)LimitExceeded|Quotas|Quota.*exceeded|exceeded quota|Quota has been met|QUOTA_EXCEEDED|ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS)`)

however ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS is replaced by ResourceExhausted here:
https://github.com/gardener/machine-controller-manager-provider-gcp/blob/295ac09467c51746f87762130a25934be202df68/pkg/gcp/machine_controller_util.go#L440-L446

What you expected to happen:
Error to be properly flagged as user error.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Gardener version (if relevant):
  • Extension version: v1.35.0
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:
@gardener-robot gardener-robot added area/ops-productivity Operator productivity related (how to improve operations) kind/bug Bug platform/gcp Google cloud platform/infrastructure labels May 21, 2024
@kon-angelo
Copy link
Contributor

Do you think it makes sense if instead of fixing this occurrence, we rather make the extension worker library aware of MCM error codes and map them to gardener error codes ? WDYT @ialidzhikov ?

@ialidzhikov
Copy link
Member

ialidzhikov commented May 23, 2024

Sounds reasonable, at least for the machine-controller-manager error codes that we can map unambiguously - Unauthenticated, PermissionDenied, ResourceExhausted. We can even do it without regex, as the corresponding error code is present in the Machine status (.status.errorCode field). I didn't check whether it is propagated to the MachineDeployment status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ops-productivity Operator productivity related (how to improve operations) kind/bug Bug platform/gcp Google cloud platform/infrastructure
Projects
None yet
Development

No branches or pull requests

4 participants