-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StatefulSet restarter always restarts replica 0 immediately after initial rollout #111
Comments
Actually this is how the restart controller is supposed to work. The Superset operator creates a stateful set labeled with Even if the immediate restart is confusing, I do not see an elegant solution to prevent this, so I propose to close this issue. |
If the outcome is "this is how it is supposed to work", then it might not be a bug, be we should still go back to the drawing board and look at what we've built, because this immediate restart is really something that shouldn't happen. Maybe the title of the issue and also it's place in this repo is not correct anymore. After talking to sigi, I'm moving it back to "Refinement: In Progress" |
Taking a step back. Outcomes from the refinement discussion: Observed Situation
Investigation Worthy Threads
Possible Next Steps
|
@soenkeliebau @lfrancke unclear how to proceed here, some input would be great. Do we leave things as they are here, or do we pursue one of the investigation threads? This ticket can be closed in favour of these other investigation threads. |
That the commons-operator is reconciling doesn't really say much, that'll happen whenever the STS is modified externally too. If you want to check whether it has triggered a restart then you'll want to check the STS YAMLs for whether the restarter annotations change. I'd also argue that the restart controller isn't the cause of the problem here, just a symptom. The restart controller will trigger when the app's config changes. Check for why that happens if you want to avoid this. |
(Case in point, if the restart controller is at fault then this will happen for all product operators, not just Superset...) |
Thanks Teo.
I'm a bit confused now, though. This sounds like the answer that you were looking for, no? |
The restart controller just watches the configmaps for changes, and triggers STS restarts based on that. The restart controller is just a messenger for the actual issue (that the CM changed). It never modifies the CMs itself. |
Understood, thank you. @siegfriedweber @fhennig @vsupalov - sorry, back to you for now? |
The restarter annotations are initially added to the STS YAMLs by the restart controller. This triggers the restart.
The restart controller also triggers when a new StatefulSet with the annotation
It happens for all StatefulSets which are annotated with I think that I accurately described the issue in my first comment:
Please tell me if I am wrong. |
@siegfriedweber and @teozkr will have a chat about this |
Hm, the label "should" still be added quickly enough that we the pods shouldn't have been scheduled yet, but maybe my memory fails me on how much of a window we have, will have to do some testing tomorrow... If that is indeed the sad case then a fail-open (since we'd still need the controller as well, the worst that could happen would be falling back to the current extra restart) mutating webhook probably makes sense at some point, and should be completely transparent to the operators. The downside is that doing a first webhook means doing a bunch of new scaffolding from scratch (certificates, kube API registration, etc)... |
Ok, having tried it out now it does look like the first replica is restarted while all other replicas are created late enough that they are up to date. Avoiding that glitch will probably require us to do the webhook... |
I have validated that a mutating webhook is able to avoid the initial restart for a simple STS... 0cbf72c Moving this ticket to commons since this is a bug in the restarter, not in superset-op specifically. |
So, is the mutating webhook that accepted solution to this problem? I see that this ticket is in refinement acceptance, I'd like to know the up- and downsides of this solution. Is there another way? how bad is:
I personally wouldn't call this ticket refined, as I wouldn't know what to do. The ticket doesn't have tasks or even acceptance criteria. |
I agree, this came up during the acceptance. So I'll move it back for now. |
A few questions:
|
Yes, good old CAP strikes again. There's no way that we could possibly hook into an API in a way that is both:
We could technically sacrifice any of these, but (to me) 2 and 3 are both far more vital than 1.
The operator and installer both need permission to work with This design would allow users who are uncomfortable with running our webhook to simply not deploy the MWC, at the cost of still having to deal with this extra restart.
I haven't tested the spike against OS specifically, but Google seems to believe yes. I also remember noticing that certain parts of OS itself is implemented using webhooks.
It might be worth bringing up at tomorrow's arch meeting. I'll add it to the list. |
That's fine for me as the worst that can happen ist hat we get the extra restart, right? Thank you. If the team is fine with this then I am as well. |
I think we didn't really discuss this broadly, I'll put it on the agenda for tomorrow, so we can have a brief look |
Yes. |
After some discussion with @sbernauer and @siegfriedweber we decided to push on with the webhook approach. |
So this means refinement is done? |
I'd say so, yes. There's still the question of how we want to prioritize this, but I guess that's your department... :P |
Teo did a handover with Siggi |
Current behaviour
New restarter-enabled StatefulSet have their replica 0 restarted after the initial rollout is complete.
Expected Behaviour
The initial rollout should be completed "normally" with no extra restarts.
Why does this happen?
There is a race condition between Kubernetes' StatefulSet controller creating the first replica Pod and commons' StatefulSet restart controller adding the restart trigger labels. If the restarter loses the race then the first replica is created without the metadata, triggering a restart once it is added.
What can we do about it?
Add a mutating webhook (see the spike) that adds the relevant metadata. The webhook must not replace the existing controller, since webhook delivery is not reliable.
However, webhook delivery requires a bunch of extra infrastructure that we do not currently have, namely:
Definition of done
podTemplate
annotations as the controller currently doesSTS.metadata.generation
should stay1
)failurePolicy: Ignore
)failurePolicy
)Original ticket
The StatefulSet of a Superset cluster is immediately restarted after its creation. This should not be necessary and should be prevented.After the restart, the Superset pods are annotated as follows:
This could be an indication that the restart controller of the commons-operator is involved.
The commons-operator is busy while the StatefulSet is restarted:
The text was updated successfully, but these errors were encountered: