Downtime after a caBundle until Secret propagation to pod #50

ahmetb · 2022-11-14T23:04:56Z

Based on my experimentation, it seems that the kubelet's latency to reflect the updates on a watched Secret (configMapAndSecretChangeDetectionStrategy=Watch) to a container's filesystem seems to be ranging from 30-100 seconds (i.e. not instant), regardless of minikube, kind, GKE or kubeadm clusters.

Does this basically mean that until the container that's running the webhook (and automating certificate management with cert-controller package), the webhook actually will be down because this library updates WebhookConfiguration's .caBundle field with the new CA cert (which instantly takes effect) and it will no longer match the served TLS certificate for another minute or so?

Is this a known issue, or something that's factored to the current design that's solved (maybe I'm seeing it incorrectly).

The text was updated successfully, but these errors were encountered:

adrianludwin · 2022-11-14T23:12:29Z

That's right, which is exactly why RestartOnSecretRefresh exists - it's probably faster to just kill the pod and let it restart

adrianludwin · 2022-11-14T23:13:21Z

There's probably a better way but I'm not sure what it is and this worked well enough :)

ahmetb · 2022-11-14T23:39:21Z

What's the invariant that'd make restarting the pod pick up the new secret?

If anything, the same container will be just restarted by the container runtime on the same host (restartpolicy) and it will still see the same volume, no? And with that, it will miss the change event and keep using the same secret.

I think it would be a lot better to do something like cross signing, or expanding the CA list by keeping both old/new (prev/next) CA certs around.

adrianludwin · 2022-11-15T00:36:00Z

I'm not sure. In my experience restarting the pod has always worked instantly. I feel like someone told me that once but it was some time ago...

Yup, serving up multiple certs would definitely be a cleaner way of doing this. Given that we set a default 10yr expiry period (IIRC), we were most concerned with the startup performance, where the original secret is effectively empty so there's zero chance of it working during the initial startup. Again, it wouldn't surprise me if cert-manager solved this in some much better way.

ahmetb · 2022-11-15T01:53:41Z

In our setup we'd much rather not use cert-manager (it comes with multiple components/CRDs).
Similarly RestartOnSecretRefresh also doesn't work because it's too violent (drops requests, and we run some HA webhooks where simultaneously restarting all of them is a recipe for disaster).

I think developing a patch around keeping both CAs in the caBundle may solve the current downtime problem while the Secret propagates and is picked up.

adrianludwin · 2022-11-15T04:07:34Z

sgtm in all cases except for the initial startup. @maxsmythe , @ritazh , wdyt?

maxsmythe · 2022-11-15T09:36:41Z

+1 to having multiple bundles. Might be worth figuring out a way to gradually roll out the cert across processes too.

ritazh · 2022-11-15T14:55:12Z

+1 on supporting multiple bundles.

yizha1 · 2023-05-18T14:03:26Z

Hi Folks, there is a similar issue ratify-project/ratify#821. The mTLS is required between Gatekeeper and external data provider. By default, cert-controller is used to generate and rotate Gatekeeper's webhook certificate. In our case, the user manually rotated the certificate. It seems Kubernetes took about 60-90 seconds to propagate changes to Secrets. During this period of delay, the request being sent to external data provider will fail.

maxsmythe · 2023-05-20T00:32:28Z

@acpana This could be interesting work.

acpana · 2023-05-25T00:16:18Z

thanks for the tag max! I can have a look at this in my downtime from other projects. I will assign it to myself when I get to it. In the meantime, folks can feel free to jump on it if they have cycles.

dvob · 2023-05-27T07:52:53Z

I also raised this problem a while ago. See #13 .

dlipovetsky · 2024-10-24T18:24:07Z

I created a diagram to help me better understand the problem.

sequenceDiagram
    autonumber

    participant client

    par Update certs
    cert-controller->>apiserver: Create/update Secret
    cert-controller->>apiserver: Update webhook configuration
    Note over kubelet: Delay before updating volume,<br>usually 30 to 100 seconds.
    kubelet->>apiserver: Read updated Secret
    kubelet->>volume: Write certs to /tmp/k8s-webhook-server/serving-certs
    webhook-server->>volume: Read certs from /tmp/k8s-webhook-server/serving-certs
    and Call webhook
    client->>apiserver: Create/update resource
    apiserver->>webhook-server: Submit create/update request
    Note over apiserver: Clients certs are from the webhook configuration
    webhook-server->>apiserver: TLS Error
    Note over webhook-server: Server certs are from the volume<br>They do not (yet) match the client certs.
    end

ahmetb changed the title ~~Downtime during updates and secret propagation~~ Downtime after a caBundle until Secret propagation to pod Nov 14, 2022

akashsinghal mentioned this issue May 25, 2023

External Data Provider mTLS Certificate Management Strategy open-policy-agent/gatekeeper#2793

Closed

fabriziosestito mentioned this issue Jul 22, 2024

Handle certificate rotation kubewarden/kubewarden-controller#7

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downtime after a caBundle until Secret propagation to pod #50

Downtime after a caBundle until Secret propagation to pod #50

ahmetb commented Nov 14, 2022 •

edited

Loading

adrianludwin commented Nov 14, 2022

adrianludwin commented Nov 14, 2022

ahmetb commented Nov 14, 2022 •

edited

Loading

adrianludwin commented Nov 15, 2022

ahmetb commented Nov 15, 2022

adrianludwin commented Nov 15, 2022

maxsmythe commented Nov 15, 2022

ritazh commented Nov 15, 2022

yizha1 commented May 18, 2023

maxsmythe commented May 20, 2023

acpana commented May 25, 2023

dvob commented May 27, 2023

dlipovetsky commented Oct 24, 2024

Downtime after a caBundle until Secret propagation to pod #50

Downtime after a caBundle until Secret propagation to pod #50

Comments

ahmetb commented Nov 14, 2022 • edited Loading

adrianludwin commented Nov 14, 2022

adrianludwin commented Nov 14, 2022

ahmetb commented Nov 14, 2022 • edited Loading

adrianludwin commented Nov 15, 2022

ahmetb commented Nov 15, 2022

adrianludwin commented Nov 15, 2022

maxsmythe commented Nov 15, 2022

ritazh commented Nov 15, 2022

yizha1 commented May 18, 2023

maxsmythe commented May 20, 2023

acpana commented May 25, 2023

dvob commented May 27, 2023

dlipovetsky commented Oct 24, 2024

ahmetb commented Nov 14, 2022 •

edited

Loading

ahmetb commented Nov 14, 2022 •

edited

Loading