-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support re-assignment of another failure domain when the machine failed to provision #353
Conversation
Welcome @jhaanvi5! |
Hi @jhaanvi5. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
✅ Deploy Preview for kubernetes-sigs-cluster-api-cloudstack ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
/cc @vignesh-goutham |
/uncc @chrisdoherty4 |
infrav1 "sigs.k8s.io/cluster-api-provider-cloudstack/api/v1beta3" | ||
"sigs.k8s.io/cluster-api-provider-cloudstack/controllers/utils" | ||
"sigs.k8s.io/cluster-api-provider-cloudstack/pkg/cloud" | ||
cserrors "sigs.k8s.io/cluster-api-provider-cloudstack/pkg/errors" | ||
"sigs.k8s.io/cluster-api-provider-cloudstack/pkg/failuredomains" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be in it's own section below right?
pkg/failuredomains/client.go
Outdated
|
||
clientConfig := &corev1.ConfigMap{} | ||
key = client.ObjectKey{Name: cloud.ClientConfigMapName, Namespace: cloud.ClientConfigMapNamespace} | ||
_ = f.Get(ctx, key, clientConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment here mentioning that client config is optional, hence we can ignore the error
@@ -26,6 +26,8 @@ import ( | |||
// The presence of a finalizer prevents CAPI from deleting the corresponding CAPI data. | |||
const MachineFinalizer = "cloudstackmachine.infrastructure.cluster.x-k8s.io" | |||
|
|||
const MachineCreateFailAnnotation = "cluster.x-k8s.io/vm-create-failed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const MachineCreateFailAnnotation = "cluster.x-k8s.io/vm-create-failed" | |
const MachineCreateFailedAnnotation = "cluster.x-k8s.io/vm-create-failed" |
/ok-to-test |
/uncc chrisdoherty4 |
r.Log.Info("Marking machine as failed to launch", "csMachine", r.ReconciliationSubject.GetName()) | ||
r.ReconciliationSubject.MarkAsFailed() | ||
|
||
if _, err := r.ReconcileDelete(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow this part. Why ReconcileDelete when the machine is marked as failed?
@@ -78,6 +83,26 @@ func (c *client) ResolveNetwork(net *infrav1.Network) (retErr error) { | |||
return nil | |||
} | |||
|
|||
// GetPublicIPs gets public IP addresses for the associated failure domain network |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why failure domain network? this code doesn't seem to have anything special about failure domain, right? Wouldn't this work for any network?
csErrorCodeRegexp, _ = regexp.Compile(".+CSExceptionErrorCode: ([0-9]+).+") | ||
|
||
// List of error codes: https://docs.cloudstack.apache.org/en/latest/developersguide/dev.html#error-handling | ||
csTerminalErrorCodes = strings.Split(getEnv("CLOUDSTACK_TERMINAL_FAILURE_CODES", "4250,9999"), ",") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the reason to make this configurable?
@@ -0,0 +1,62 @@ | |||
/* | |||
Copyright 2022 The Kubernetes Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copyright 2022 The Kubernetes Authors. | |
Copyright 2024 The Kubernetes Authors. |
🤷🏻♂️
) | ||
|
||
type Balancer interface { | ||
Assign(ctx context.Context, csMachine *infrav1.CloudStackMachine, capiMachine *clusterv1.Machine, fds []infrav1.CloudStackFailureDomainSpec) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does entail to assign a failure domain? Only setting the field in the machines?
In that case I would vote to rewrite this to return the actual failure domain, and let the caller set the field in the machine. This is just a personal opinion, but I think that minimizing side effects makes things easier to understand. In this case, it separates the logic to decide a failure domain from the code to set the failure domain.
return newReassigningFailureDomainBalancer(newFallingBackFailureDomainBalancer( | ||
newFreeIPValidatingFailureDomainBalancer(csClientFactory), | ||
newRandomFailureDomainBalancer(), | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the idea of using composition here. However, what's the benefit in this case?
It doesn't seem like we are reusing any of the Balancers. Do we expect to have to use them in a different way at some point? Can we actually use them separately? in other words, are they truly independent?
If it's just to decompose the problem into smaller parts? Does that make the program simpler or more complex?
if capiMachineHasFailureDomain(capiMachine) && !csMachine.HasFailed() { | ||
csMachine.Spec.FailureDomainName = *capiMachine.Spec.FailureDomain | ||
assignFailureDomainLabel(csMachine, capiMachine) | ||
} else if err := r.delegate.Assign(ctx, csMachine, capiMachine, fds); err != nil { | ||
return err | ||
} | ||
|
||
if capiMachineHasFailureDomain(capiMachine) && csMachine.HasFailed() { | ||
capiMachine.Spec.FailureDomain = pointer.String(csMachine.Spec.FailureDomainName) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll admit I'm not an expert on this are of the code base, but the intent here is not clear to me. If this is the best way to write this logic, I think it could benefit from some comments to explain what it's supposed to be doing (at a higher level than the code).
)) | ||
} | ||
|
||
type reassigningFailureDomainBalancer struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all this implementations of Balancer could use some go docs explaining what they do.
@jhaanvi5 What is the status/ETA for this PR? |
Improve destroy VM reconciliation to reduce duplicated async jobs
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jhaanvi5, vivek-koppuru The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@vishesh92 Figuring this out, I think this is done for the most part so it would be great to get your review as well on this. This came as an ask so need to follow up based on whether any of these changed can't make it to the v0.5.0 release, but we would love to have it included pending review from you all! |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@vivek-koppuru I reviewed this PR a little bit and have same questions as @g-gaston . Could you please resolve/address the comments? This PR also has some conflicts now. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Issue #352
Description of changes: CAPC chooses a random failure domain to deploy worker machines. When this VM deploy fails, irrespective of the type of error, CAPC will keep re-attempting to deploy a VM until CAPI replaces the owner machine. This change introduces a failure domain balancer with fallback mechanism, primary balancer selects failure domain from network with most free IPs and if it fails it fallbacks to random failure domain selection.
Testing performed:
make test
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.