Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Ensure that insecureSkipTLSVerify is false before setting caBundle on APIService #160

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

JorTurFer
Copy link
Contributor

If the APIService has already registered the value insecureSkipTLSVerify: true, cert-controller fails patching the APIService because caBundle and insecureSkipTLSVerify: true aren't compatible.
This PR updates the behaviour to ensure that insecureSkipTLSVerify is false before setting the caBundle

@JorTurFer JorTurFer changed the title feat: Ensure that insecureSkipTLSVerify is false before setting caBundle on APIService fix: Ensure that insecureSkipTLSVerify is false before setting caBundle on APIService Nov 21, 2023
Comment on lines 440 to 442
if err := unstructured.SetNestedField(apiService.Object, false, "spec", "insecureSkipTLSVerify"); err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious to get some of the maintainers thoughts on whether this actually has to happen here, in cert-controller. If cert-controller is going to be used with APIService, presumably we are going to inject a caBundle so why not set this field outside of cert-controller to begin with ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem we have faced with is that when you set insecureSkipTLSVerify: true , you have to set explicitly insecureSkipTLSVerify: false to remove the value.
As insecureSkipTLSVerify: false is the default value, k8s api removes it automatically and CD tools like ArgoCD/Flux enter on permanent out-of-sync status.
kedacore/charts#550

In this scenario, we have some options:

  • Adding a step to remove the field manually (breaking the helm upgrade process)
  • Adding explicitly the value in helm chart, producing out-of-sync on CD tools
  • Updating insecureSkipTLSVerify before setting caBundle

Maybe I've missed something and there is any other option

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused here. You're saying that having insecureSkipTLSVerify: false in the Helm chart causes CI/CD pipelines to detect drift because the API server strips default values, correct?

I'm curious what makes cert-controller different in this regard, since it's merely another thing essentially running kubectl apply. Would cert-controller also be subject to the same default-stripping issue, leaving you right back where you began?

Or is the objective to remove the state from the Helm chart such that there is no longer any check for drift for this value?

Modifying controller code to meet the needs of CI/CD code does seem a bit backwards. I am also wary of potentially manually overriding user intent on what could be a very salient field (is there some reason why insecureSkipTLSVerify is set? Will un-setting it cause an outage? Can we be sure the answer will be identical for all projects that consume cert-controller?)

As for other options, maybe a post-install hook on the Helm chart to modify the value?

Here is an example of a post-install hook G8r uses:

https://github.com/open-policy-agent/gatekeeper/blob/master/charts/gatekeeper/templates/namespace-post-install.yaml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused here. You're saying that having insecureSkipTLSVerify: false in the Helm chart causes CI/CD pipelines to detect drift because the API server strips default values, correct?

Yes, as it's the default value, it is removed automatically. If you deploy an apiservice with explicitly set insecureSkipTLSVerify: false and then you get it from the cluster, you will see that your set value insecureSkipTLSVerify: false has been chopped from the manifest. This triggers out-of-sync on CD systems like ArgoCD.
You can just try with this:

apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  labels:
    app.kubernetes.io/name: v1beta1.external.metrics.k8s.io
    app.kubernetes.io/version: latest
    app.kubernetes.io/part-of: keda-operator
  name: v1beta1.custom.metrics.k8s.io
# nosemgrep: yaml.kubernetes.security.skip-tls-verify-service.skip-tls-verify-service
spec:
  service:
    name: keda-metrics-apiserver
    namespace: keda
  group: external.metrics.k8s.io
  version: v1beta1
  groupPriorityMinimum: 100
  versionPriority: 100
  insecureSkipTLSVerify: false

I'm curious what makes cert-controller different in this regard, since it's merely another thing essentially running kubectl apply. Would cert-controller also be subject to the same default-stripping issue, leaving you right back where you began?

Yes, it's removed exactly in the same way, but if it's cert-controller, we don't need to set it in the helm chart, so external tools aren't affected.

Or is the objective to remove the state from the Helm chart such that there is no longer any check for drift for this value?

The objective is to not have to explicitly set the value to default value in general. (not only from helm but from raw manifests)

Modifying controller code to meet the needs of CI/CD code does seem a bit backwards. I am also wary of potentially manually overriding user intent on what could be a very salient field (is there some reason why insecureSkipTLSVerify is set? Will un-setting it cause an outage? Can we be sure the answer will be identical for all projects that consume cert-controller?)

Actually, if cert-controller tries to patch the apiservice to add the caBundle but the apiservice has set insecureSkipTLSVerify: true, cert-controller will fail because insecureSkipTLSVerify: true and caBundle are mutually exclusive. If the value has been set to true and then cert-controller has been enabled, cert-controller won't work at all, printing errors all the time due to this mutual exclusivity.

As for other options, maybe a post-install hook on the Helm chart to modify the value?

We don't need this feature anymore (nor for helm nor for raw yamls) as we just needed it during some releases after the migration from random certs generated by the pods to real certs, stored in Kubernetes and managed by cert-controller/cert-manager. I mean, we don't need this feature anymore for KEDA xD

I've just pushed it instead of just closing the PR because this project is awesome and has helped us a lot and I assume that we aren't the only one who faces with this and the code was already done.

I can extend any other doubt that you have, said that, if you think that this doesn't make sense, we can close the PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if cert-controller tries to patch the apiservice to add the caBundle but the apiservice has set insecureSkipTLSVerify: true, cert-controller will fail because insecureSkipTLSVerify: true and caBundle are mutually exclusive. If the value has been set to true and then cert-controller has been enabled, cert-controller won't work at all, printing errors all the time due to this mutual exclusivity.

This makes sense, but if a user has intentionally skipped TLS verification, will cert-controller overriding it break their use case? If so, consumers would be unable to use that version of cert-controller due to the mandatory override. It comes down to whether its better for cert-controller to return errors or for it to override user intent without context as to why it was set.

I've just pushed it instead of just closing the PR because this project is awesome and has helped us a lot and I assume that we aren't the only one who faces with this and the code was already done.

I very much appreciate the consideration and am glad the project has helped you!

I can extend any other doubt that you have, said that, if you think that this doesn't make sense, we can close the PR

I think if the override was optional, such that projects could opt-in, I would be less concerned. This would probably boil down to adding a new option to the rotator:

// CertRotator contains cert artifacts and a channel to close when the certs are ready.
type CertRotator struct {
reader SyncingReader
writer client.Writer
SecretKey types.NamespacedName
CertDir string
CAName string
CAOrganization string
DNSName string
ExtraDNSNames []string
IsReady chan struct{}
Webhooks []WebhookInfo
// FieldOwner is the optional fieldmanager of the webhook updated fields.
FieldOwner string
RestartOnSecretRefresh bool
ExtKeyUsages *[]x509.ExtKeyUsage
// RequireLeaderElection should be set to true if the CertRotator needs to
// be run in the leader election mode.
RequireLeaderElection bool
// CaCertDuration sets how long a CA cert will be valid for.
CaCertDuration time.Duration
// ServerCertDuration sets how long a server cert will be valid for.
ServerCertDuration time.Duration
// RotationCheckFrequency sets how often the rotation is executed
RotationCheckFrequency time.Duration
// LookaheadInterval sets how long before the certificate is renewed
LookaheadInterval time.Duration
// CertName and Keyname override certificate path
CertName string
KeyName string
certsMounted chan struct{}
certsNotMounted chan struct{}
wasCAInjected *atomic.Bool
caNotInjected chan struct{}
// testNoBackgroundRotation doesn't actually start the rotator in the background.
// This should only be used for testing.
testNoBackgroundRotation bool
}

Which defaults to off so that behavior would be backwards compatible for users on upgrade.

Definitely up to you if the extra work is worth your time. I apologize for the slow review cycle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kindly reminder @maxsmythe 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want to test both cases, but I hear you about not wanting to propagate the expected assertion everywhere.

Maybe we can use functional options to make this a backwards-compatible change to testWebhook().

Functional options blog: https://commandcenter.blogspot.com/2014/01/self-referential-functions-and-design.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f**k! I missed your comment 🤦
I'll review the PR this week 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added another split test for this specific feature. PTAL when you have some time @maxsmythe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kindly reminder @maxsmythe 😄

pkg/rotator/rotator_test.go Show resolved Hide resolved
Signed-off-by: Jorge Turrado <[email protected]>
@codecov-commenter
Copy link

codecov-commenter commented Nov 30, 2023

Codecov Report

Attention: Patch coverage is 60.00000% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 60.13%. Comparing base (190188d) to head (b6466d0).

Files Patch % Lines
pkg/rotator/rotator.go 60.00% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #160      +/-   ##
==========================================
+ Coverage   57.16%   60.13%   +2.97%     
==========================================
  Files           1        1              
  Lines         572      577       +5     
==========================================
+ Hits          327      347      +20     
+ Misses        181      170      -11     
+ Partials       64       60       -4     
Flag Coverage Δ
unittests 60.13% <60.00%> (+2.97%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@JorTurFer
Copy link
Contributor Author

Hello!
Any update about this?

@JorTurFer
Copy link
Contributor Author

Hello @ritazh @maxsmythe
Sorry for linking you directly, but this PR hasn't got any answer for a month. Could you review it please?

Copy link
Member

@ritazh ritazh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ritazh ritazh requested a review from maxsmythe January 23, 2024 04:24
@JorTurFer
Copy link
Contributor Author

Hello ✋
Can this be merged?

@JaydipGabani
Copy link
Contributor

@JorTurFer We are waiting on lgtm from @maxsmythe as well.

Signed-off-by: Jorge Turrado <[email protected]>
@@ -778,7 +788,7 @@ func (r *ReconcileWH) Reconcile(ctx context.Context, request reconcile.Request)
artifacts, err := buildArtifactsFromSecret(secret)
if err != nil {
crLog.Error(err, "secret is not well-formed, cannot update webhook configurations")
return reconcile.Result{}, nil
return reconcile.Result{}, err
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found this error too. If the reconciliation has failed, we should reschedule it manually or just return the error for auto rescheduling it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants