Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StatefulSet restarter always restarts replica 0 immediately after initial rollout #111

Open
1 of 6 tasks
siegfriedweber opened this issue Sep 19, 2022 · 25 comments
Open
1 of 6 tasks

Comments

@siegfriedweber
Copy link
Member

siegfriedweber commented Sep 19, 2022

Current behaviour

New restarter-enabled StatefulSet have their replica 0 restarted after the initial rollout is complete.

Expected Behaviour

The initial rollout should be completed "normally" with no extra restarts.

Why does this happen?

There is a race condition between Kubernetes' StatefulSet controller creating the first replica Pod and commons' StatefulSet restart controller adding the restart trigger labels. If the restarter loses the race then the first replica is created without the metadata, triggering a restart once it is added.

What can we do about it?

Add a mutating webhook (see the spike) that adds the relevant metadata. The webhook must not replace the existing controller, since webhook delivery is not reliable.

However, webhook delivery requires a bunch of extra infrastructure that we do not currently have, namely:

  1. We must implement the K8s webhook HTTPS API (https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/)
  2. We must generate a TLS certificate and write that into the MWC

Definition of done

  • Webhook certificate is managed (provisioned and renewed) by the commons operator
  • The Webhook should apply the same podTemplate annotations as the controller currently does
    • Initial STS rollout does not cause a restart (STS.metadata.generation should stay 1)
  • Controller must still apply metadata if the webhook is disabled and/or fails (at the cost of still doing the extra restart in this case)
  • The webhook must fail open (failurePolicy: Ignore)
  • A Kuttl test should verify the above (maybe minus the failurePolicy)
Original ticket The StatefulSet of a Superset cluster is immediately restarted after its creation. This should not be necessary and should be prevented.
$ kubectl describe statefulset simple-superset-node-default
...
Events:
  Type    Reason            Age                    From                    Message
  ----    ------            ----                   ----                    -------
  Normal  SuccessfulDelete  3m31s                  statefulset-controller  delete Pod simple-superset-node-default-0 in StatefulSet simple-superset-node-default successful
  Normal  SuccessfulCreate  2m59s (x2 over 3m31s)  statefulset-controller  create Pod simple-superset-node-default-0 in StatefulSet simple-superset-node-default successful

After the restart, the Superset pods are annotated as follows:

annotations:
  configmap.restarter.stackable.tech/simple-superset-node-default: cf60300e-0c45-4ee2-b60c-de53b0084182/21998
  secret.restarter.stackable.tech/simple-superset-credentials: e0e4b781-46f9-44f4-80c8-a0876b91ed8b/16909

This could be an indication that the restart controller of the commons-operator is involved.

The commons-operator is busy while the StatefulSet is restarted:

2022-09-16T10:02:50.228971Z  INFO stackable_operator::logging::controller: Reconciled object controller.name="pod.restarter.commons.stackable.tech" object=Pod.v1./simple-superset-node-default-0.default
2022-09-16T10:02:50.236144Z  INFO stackable_operator::logging::controller: Reconciled object controller.name="pod.restarter.commons.stackable.tech" object=Pod.v1./simple-superset-node-default-0.default
2022-09-16T10:02:50.239371Z  INFO stackable_operator::logging::controller: Reconciled object controller.name="statefulset.restarter.commons.stackable.tech" object=StatefulSet.v1.apps/simple-superset-node-default.default
2022-09-16T10:02:50.255219Z  INFO stackable_operator::logging::controller: Reconciled object controller.name="statefulset.restarter.commons.stackable.tech" object=StatefulSet.v1.apps/simple-superset-node-default.default
2022-09-16T10:02:50.258511Z  INFO stackable_operator::logging::controller: Reconciled object controller.name="pod.restarter.commons.stackable.tech" object=Pod.v1./simple-superset-node-default-0.default
2022-09-16T10:02:50.266146Z  INFO stackable_operator::logging::controller: Reconciled object controller.name="pod.restarter.commons.stackable.tech" object=Pod.v1./simple-superset-node-default-0.default
2022-09-16T10:02:50.274647Z  INFO stackable_operator::logging::controller: Reconciled object controller.name="statefulset.restarter.commons.stackable.tech" object=StatefulSet.v1.apps/simple-superset-node-default.default
2022-09-16T10:02:51.433621Z  INFO stackable_operator::logging::controller: Reconciled object controller.name="pod.restarter.commons.stackable.tech" object=Pod.v1./simple-superset-node-default-0.default
2022-09-16T10:02:51.449009Z  INFO stackable_operator::logging::controller: Reconciled object controller.name="statefulset.restarter.commons.stackable.tech" object=StatefulSet.v1.apps/simple-superset-node-default.default
@siegfriedweber siegfriedweber self-assigned this Oct 20, 2022
@siegfriedweber
Copy link
Member Author

Actually this is how the restart controller is supposed to work. The Superset operator creates a stateful set labeled with restarter.stackable.tech/enabled=true. The restart controller of the commons operator watches this labeled stateful set and adds the UIDs and resource versions of the referenced config maps and secrets as annotations to the pod template. This triggers a restart of the stateful set right after its first creation. The commons operator watches these config maps and secrets and updates the annotations whenever these resources change which in turn triggers a restart. This ensures that Superset always runs with the latest configuration.

Even if the immediate restart is confusing, I do not see an elegant solution to prevent this, so I propose to close this issue.

@fhennig
Copy link
Contributor

fhennig commented Oct 24, 2022

If the outcome is "this is how it is supposed to work", then it might not be a bug, be we should still go back to the drawing board and look at what we've built, because this immediate restart is really something that shouldn't happen.

Maybe the title of the issue and also it's place in this repo is not correct anymore.

After talking to sigi, I'm moving it back to "Refinement: In Progress"

@vsupalov
Copy link
Contributor

vsupalov commented Oct 25, 2022

Taking a step back. Outcomes from the refinement discussion:

Observed Situation

  • A clean system is easier to monitor, debug, and operate
  • Random restarts are surprising in a bad way
    • Logs are not mentioning reason for this behaviour - just the restart is visible
  • We want our systems to be understandable, and observable (know why restart happens)
  • The restart controller causes a restart, which might be avoided
  • Two operators working on the same resource interact in sometimes hard to understand fashion (badly designed system?)

Investigation Worthy Threads

  • Can we avoid this kind of initial restart? Are they strictly necessary?
    • Mutating admission controller / webhook (?) link
    • Port restarting logic into the operator-rs framework (it is a library!)? Reusable functions.
  • Can we add logs to understand what's going on
    • Add logging outputs to justify / announce restarts
  • Root cause analysis: how was the decision made that we want to have this kind of pattern?
    • Possible "design guideline" which could have been helpful: "only one operator is modifying a single resource"

Possible Next Steps

  • Create spike investigating if restart is necessary?
  • Address "how was restarting decision" in architecture meeting?
  • Do we need somebody who feels responsible for higher-level architecture patterns?

@vsupalov
Copy link
Contributor

@soenkeliebau @lfrancke unclear how to proceed here, some input would be great. Do we leave things as they are here, or do we pursue one of the investigation threads?

This ticket can be closed in favour of these other investigation threads.

@nightkr
Copy link
Member

nightkr commented Oct 25, 2022

That the commons-operator is reconciling doesn't really say much, that'll happen whenever the STS is modified externally too. If you want to check whether it has triggered a restart then you'll want to check the STS YAMLs for whether the restarter annotations change.

I'd also argue that the restart controller isn't the cause of the problem here, just a symptom. The restart controller will trigger when the app's config changes. Check for why that happens if you want to avoid this.

@nightkr
Copy link
Member

nightkr commented Oct 25, 2022

(Case in point, if the restart controller is at fault then this will happen for all product operators, not just Superset...)

@lfrancke
Copy link
Member

Thanks Teo.

The restart controller of the commons operator watches this labeled stateful set and adds the UIDs and resource versions of the referenced config maps and secrets as annotations to the pod template. This triggers a restart of the stateful set right after its first creation.

I'm a bit confused now, though. This sounds like the answer that you were looking for, no?
I must be missing something still. I thought the restart happens because the restart controller changes the configmaps which in turn triggers a restart of the STS?

@nightkr
Copy link
Member

nightkr commented Oct 25, 2022

The restart controller just watches the configmaps for changes, and triggers STS restarts based on that. The restart controller is just a messenger for the actual issue (that the CM changed). It never modifies the CMs itself.

@lfrancke
Copy link
Member

Understood, thank you.
So, the described behavior has to come from somewhere else then.

@siegfriedweber @fhennig @vsupalov - sorry, back to you for now?

@siegfriedweber
Copy link
Member Author

If you want to check whether it has triggered a restart then you'll want to check the STS YAMLs for whether the restarter annotations change.

The restarter annotations are initially added to the STS YAMLs by the restart controller. This triggers the restart.

The restart controller will trigger when the app's config changes.

The restart controller also triggers when a new StatefulSet with the annotation restarter.stackable.tech/enabled=true was created.

(Case in point, if the restart controller is at fault then this will happen for all product operators, not just Superset...)

It happens for all StatefulSets which are annotated with restarter.stackable.tech/enabled=true. That is, it also happens for the Airflow operator. The other operators do not use this annotation yet.

I think that I accurately described the issue in my first comment:

Actually this is how the restart controller is supposed to work. The Superset operator creates a stateful set labeled with restarter.stackable.tech/enabled=true. The restart controller of the commons operator watches this labeled stateful set and adds the UIDs and resource versions of the referenced config maps and secrets as annotations to the pod template. This triggers a restart of the stateful set right after its first creation. The commons operator watches these config maps and secrets and updates the annotations whenever these resources change which in turn triggers a restart. This ensures that Superset always runs with the latest configuration.

Please tell me if I am wrong.

@soenkeliebau
Copy link
Member

@siegfriedweber and @teozkr will have a chat about this

@nightkr
Copy link
Member

nightkr commented Oct 27, 2022

Hm, the label "should" still be added quickly enough that we the pods shouldn't have been scheduled yet, but maybe my memory fails me on how much of a window we have, will have to do some testing tomorrow... If that is indeed the sad case then a fail-open (since we'd still need the controller as well, the worst that could happen would be falling back to the current extra restart) mutating webhook probably makes sense at some point, and should be completely transparent to the operators.

The downside is that doing a first webhook means doing a bunch of new scaffolding from scratch (certificates, kube API registration, etc)...

@nightkr
Copy link
Member

nightkr commented Oct 27, 2022

Ok, having tried it out now it does look like the first replica is restarted while all other replicas are created late enough that they are up to date. Avoiding that glitch will probably require us to do the webhook...

@nightkr
Copy link
Member

nightkr commented Oct 28, 2022

I have validated that a mutating webhook is able to avoid the initial restart for a simple STS... 0cbf72c

Moving this ticket to commons since this is a bug in the restarter, not in superset-op specifically.

@nightkr nightkr transferred this issue from stackabletech/superset-operator Oct 28, 2022
@nightkr nightkr changed the title Prevent unnecessary restart of the StatefulSet StatefulSet restarter always restarts replica 0 immediately after initial rollout Oct 28, 2022
@fhennig
Copy link
Contributor

fhennig commented Oct 31, 2022

So, is the mutating webhook that accepted solution to this problem? I see that this ticket is in refinement acceptance, I'd like to know the up- and downsides of this solution. Is there another way? how bad is:

The downside is that doing a first webhook means doing a bunch of new scaffolding from scratch (certificates, kube API registration, etc)...

I personally wouldn't call this ticket refined, as I wouldn't know what to do. The ticket doesn't have tasks or even acceptance criteria.

@lfrancke
Copy link
Member

lfrancke commented Nov 1, 2022

I agree, this came up during the acceptance. So I'll move it back for now.

@lfrancke lfrancke moved this from Refinement Acceptance: In Progress to Refinement: Waiting for in Stackable Engineering Nov 1, 2022
@lfrancke lfrancke moved this from Refinement: Waiting for to Next in Stackable Engineering Nov 7, 2022
@nightkr nightkr moved this from Next to Refinement: In Progress in Stackable Engineering Nov 8, 2022
@nightkr nightkr self-assigned this Nov 8, 2022
@nightkr nightkr moved this from Refinement: In Progress to Refinement Acceptance: Waiting for in Stackable Engineering Nov 8, 2022
@lfrancke
Copy link
Member

lfrancke commented Nov 8, 2022

A few questions:

  1. Do I understand it correctly that this will "only" fix the issue in most cases since delivery of webhooks is not reliable?
  2. Do webhooks require any kind of extra privileges that we currently don't need?
  3. Do webhooks work on OpenShift?
  4. Does this need to be presented to the team again? I assume no because I believe this is the outcome of the last team session, right?

@lfrancke lfrancke moved this from Refinement Acceptance: Waiting for to Refinement Acceptance: In Progress in Stackable Engineering Nov 8, 2022
@lfrancke lfrancke self-assigned this Nov 8, 2022
@nightkr
Copy link
Member

nightkr commented Nov 8, 2022

Do I understand it correctly that this will "only" fix the issue in most cases since delivery of webhooks is not reliable?

Yes, good old CAP strikes again. There's no way that we could possibly hook into an API in a way that is both:

  1. Synchronous (happens before the write completes)
  2. Exhaustive (never misses any events)
  3. Reliable (doesn't bring down (parts of) the hooked API when it fails)

We could technically sacrifice any of these, but (to me) 2 and 3 are both far more vital than 1.

Do webhooks require any kind of extra privileges that we currently don't need?

The operator and installer both need permission to work with MutatingWebhookConfiguration objects. Additionally, webhooks do introduce the risk of bugs and/or misconfiguration wreaking havoc on the larger system.

This design would allow users who are uncomfortable with running our webhook to simply not deploy the MWC, at the cost of still having to deal with this extra restart.

Do webhooks work on OpenShift?

I haven't tested the spike against OS specifically, but Google seems to believe yes. I also remember noticing that certain parts of OS itself is implemented using webhooks.

Does this need to be presented to the team again? I assume no because I believe this is the outcome of the last team session, right?

It might be worth bringing up at tomorrow's arch meeting. I'll add it to the list.

@lfrancke
Copy link
Member

lfrancke commented Nov 8, 2022

Yes, good old CAP strikes again. There's no way that we could possibly hook into an API in a way that is both:

That's fine for me as the worst that can happen ist hat we get the extra restart, right?
Similar to the answer to question two.

Thank you. If the team is fine with this then I am as well.

@fhennig
Copy link
Contributor

fhennig commented Nov 8, 2022

I think we didn't really discuss this broadly, I'll put it on the agenda for tomorrow, so we can have a brief look

@nightkr
Copy link
Member

nightkr commented Nov 9, 2022

That's fine for me as the worst that can happen ist hat we get the extra restart, right?

Yes.

@nightkr
Copy link
Member

nightkr commented Nov 10, 2022

After some discussion with @sbernauer and @siegfriedweber we decided to push on with the webhook approach.

@lfrancke
Copy link
Member

So this means refinement is done?

@nightkr
Copy link
Member

nightkr commented Nov 10, 2022

I'd say so, yes. There's still the question of how we want to prioritize this, but I guess that's your department... :P

@lfrancke lfrancke moved this from Refinement Acceptance: In Progress to Ready for Development in Stackable Engineering Nov 11, 2022
@nightkr nightkr self-assigned this Nov 14, 2022
@nightkr nightkr moved this from Ready for Development to Development: In Progress in Stackable Engineering Nov 14, 2022
@sbernauer
Copy link
Member

Teo did a handover with Siggi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants