Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downtime after a caBundle until Secret propagation to pod #50

Open
ahmetb opened this issue Nov 14, 2022 · 13 comments
Open

Downtime after a caBundle until Secret propagation to pod #50

ahmetb opened this issue Nov 14, 2022 · 13 comments

Comments

@ahmetb
Copy link

ahmetb commented Nov 14, 2022

Based on my experimentation, it seems that the kubelet's latency to reflect the updates on a watched Secret (configMapAndSecretChangeDetectionStrategy=Watch) to a container's filesystem seems to be ranging from 30-100 seconds (i.e. not instant), regardless of minikube, kind, GKE or kubeadm clusters.

Does this basically mean that until the container that's running the webhook (and automating certificate management with cert-controller package), the webhook actually will be down because this library updates WebhookConfiguration's .caBundle field with the new CA cert (which instantly takes effect) and it will no longer match the served TLS certificate for another minute or so?

Is this a known issue, or something that's factored to the current design that's solved (maybe I'm seeing it incorrectly).

@ahmetb ahmetb changed the title Downtime during updates and secret propagation Downtime after a caBundle until Secret propagation to pod Nov 14, 2022
@adrianludwin
Copy link
Contributor

That's right, which is exactly why RestartOnSecretRefresh exists - it's probably faster to just kill the pod and let it restart

@adrianludwin
Copy link
Contributor

There's probably a better way but I'm not sure what it is and this worked well enough :)

@ahmetb
Copy link
Author

ahmetb commented Nov 14, 2022

What's the invariant that'd make restarting the pod pick up the new secret?

If anything, the same container will be just restarted by the container runtime on the same host (restartpolicy) and it will still see the same volume, no? And with that, it will miss the change event and keep using the same secret.

I think it would be a lot better to do something like cross signing, or expanding the CA list by keeping both old/new (prev/next) CA certs around.

@adrianludwin
Copy link
Contributor

I'm not sure. In my experience restarting the pod has always worked instantly. I feel like someone told me that once but it was some time ago...

Yup, serving up multiple certs would definitely be a cleaner way of doing this. Given that we set a default 10yr expiry period (IIRC), we were most concerned with the startup performance, where the original secret is effectively empty so there's zero chance of it working during the initial startup. Again, it wouldn't surprise me if cert-manager solved this in some much better way.

@ahmetb
Copy link
Author

ahmetb commented Nov 15, 2022

In our setup we'd much rather not use cert-manager (it comes with multiple components/CRDs).
Similarly RestartOnSecretRefresh also doesn't work because it's too violent (drops requests, and we run some HA webhooks where simultaneously restarting all of them is a recipe for disaster).

I think developing a patch around keeping both CAs in the caBundle may solve the current downtime problem while the Secret propagates and is picked up.

@adrianludwin
Copy link
Contributor

sgtm in all cases except for the initial startup. @maxsmythe , @ritazh , wdyt?

@maxsmythe
Copy link
Contributor

+1 to having multiple bundles. Might be worth figuring out a way to gradually roll out the cert across processes too.

@ritazh
Copy link
Member

ritazh commented Nov 15, 2022

+1 on supporting multiple bundles.

@yizha1
Copy link

yizha1 commented May 18, 2023

Hi Folks, there is a similar issue ratify-project/ratify#821. The mTLS is required between Gatekeeper and external data provider. By default, cert-controller is used to generate and rotate Gatekeeper's webhook certificate. In our case, the user manually rotated the certificate. It seems Kubernetes took about 60-90 seconds to propagate changes to Secrets. During this period of delay, the request being sent to external data provider will fail.

@maxsmythe
Copy link
Contributor

@acpana This could be interesting work.

@acpana
Copy link
Contributor

acpana commented May 25, 2023

thanks for the tag max! I can have a look at this in my downtime from other projects. I will assign it to myself when I get to it. In the meantime, folks can feel free to jump on it if they have cycles.

@dvob
Copy link

dvob commented May 27, 2023

I also raised this problem a while ago. See #13 .

@dlipovetsky
Copy link

I created a diagram to help me better understand the problem.

sequenceDiagram
    autonumber

    participant client

    par Update certs
    cert-controller->>apiserver: Create/update Secret
    cert-controller->>apiserver: Update webhook configuration
    Note over kubelet: Delay before updating volume,<br>usually 30 to 100 seconds.
    kubelet->>apiserver: Read updated Secret
    kubelet->>volume: Write certs to /tmp/k8s-webhook-server/serving-certs
    webhook-server->>volume: Read certs from /tmp/k8s-webhook-server/serving-certs
    and Call webhook
    client->>apiserver: Create/update resource
    apiserver->>webhook-server: Submit create/update request
    Note over apiserver: Clients certs are from the webhook configuration
    webhook-server->>apiserver: TLS Error
    Note over webhook-server: Server certs are from the volume<br>They do not (yet) match the client certs.
    end
Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants