Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: tlspolicy enforced condition when certificate/issuer ready condition is missing #715

Merged
merged 10 commits into from
Jul 10, 2024

Conversation

KevFan
Copy link
Contributor

@KevFan KevFan commented Jun 26, 2024

Description

Closes: #704

  • fix: tlspolicy enforced condition when certificate/issuer ready condition is missing
    • Previous using IsStatusConditionFalse will return true if the Ready status condition is missing which caused TLSPolicy to incorrectly set the Enforced condition to true
  • feat: TLSPolicy watch for owned Certificates
  • feat: watch for issuer/clusterIssuer status changes for TLSPolicy
    • TLSPolicy should watch for status changes to ClusterIssuer/Issuer to react to whether the policy is correctly enforced or not. Also previously TLSPolicy integration tests are a bit flaky due to an certain order of events must happen / resources must be ready in order to pass the test
  • regression: do not wait for gateways to be ready for tlspolicy
    • There is a regression issue where TLSPolicy should not wait for the gateways to be ready as the controller may be creating the neceesary resources in order for the gateway to be in the ready state
  • fix: get all gateways events for tlspolicy
    • Previously the gateway event for TLSPolicy is filtered out if the gateway is not in a programmed state in the DAG, but is a valid use case for this policy. This caused the test TLSPolicy controller with multiple https listener [It] should delete tls certificate when listener is removed to be flaky. Added this filter option to the DAG to work as expected for TLSPolicy.

Verification

Passing tests should be sufficient but if you want to manually verrify:

  • Deploy this branch
make local-setup
  • Apply TLSPolicy
kubectl apply -n istio-system -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: istio-ingressgateway
spec:
  gatewayClassName: istio
  listeners:
  - name: http
    hostname: '*.toys.io'
    port: 443
    protocol: HTTPS
    tls:
      mode: Terminate
      certificateRefs:
      - name: toys
        kind: Secret
    allowedRoutes:
      namespaces:
        from: All    
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: selfsigned-issuer
spec:
  selfSigned: {}
---
apiVersion: kuadrant.io/v1alpha1
kind: TLSPolicy
metadata:
  name: gw-tls
spec:
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: istio-ingressgateway
  issuerRef:
    group: cert-manager.io
    kind: Issuer
    name: selfsigned-issuer
EOF
  • Watch TLSPolicy status
watch -n 0.1 "kubectl get tlspolicy -A -o yaml | yq '.items[0]'"
  • Verified enforced status is:
    - lastTransitionTime: "2024-07-01T08:59:46Z"
      message: TLSPolicy has been successfully enforced
      reason: Enforced
      status: "True"
      type: Enforced
  • Delete certificate secret
kubectl delete secrets toys -n istio-system
  • Verify TLSPolicy enforced condition quickly tranistions to false and back to true

@KevFan KevFan self-assigned this Jun 26, 2024
@KevFan KevFan added kind/bug Something isn't working kind/enhancement New feature or request size/medium labels Jun 28, 2024
@KevFan KevFan force-pushed the ssues/704 branch 2 times, most recently from 910eb98 to 21e0a1c Compare June 28, 2024 16:47
@KevFan KevFan marked this pull request as ready for review July 1, 2024 09:34
@KevFan KevFan requested a review from a team as a code owner July 1, 2024 09:34
@KevFan KevFan requested a review from mikenairn July 1, 2024 09:34
@mikenairn
Copy link
Member

mikenairn commented Jul 1, 2024

Why is it necessary to filter out Gateways based on the "programmed" status here or here? I'm not familiar with DAG, or how it is being used, but would it not be easier for all events to make it to each policy controller and they decide what to do?

@KevFan
Copy link
Contributor Author

KevFan commented Jul 1, 2024

Why is it necessary to filter out Gateways based on the "programmed" status here or here? I'm not familiar with DAG, or how it is being used, but would it not be easier for all events to make it each policy controller and they decide what to do?

The intention for filtering here, at least for RateLimit / Auth, is that it only cares about programmed gateways, since if it's in an unready state, it should not do any of the configuration since no traffic will be recieved anyway. The policy will be marked as Accepted: False from this error.

The DAG (a view of the cluster of things we care about, that was first used at the reconciler level) does the same filtering because of the same assumption that it should only care about programmed Gateways. But this filtering now also affects events since it was introduced at the gateway event mapper level from this change. I believe it's also with the same intention, no point to reconciling on policies on a gateway event where the Gateway is unready at least for RateLimit / Auth. But thinking on this, an event where the gateway goes from a Programmed: true -> Programmed: false state will be ignored, and all the policies under this gateway will have the incorrect statuses, so we probably should not ignore them at the event level 🤔

In summary I think it's necessary at least here to mark policies as Accepted: false for at least Auth & RateLimit (dont know if this should be the case for TLS and DNS) and probably shouldn't do this filtering at the event level.

Not sure does anyone else want to give an opinion on this though 🤔

@mikenairn
Copy link
Member

But thinking on this, an event where the gateway goes from a Programmed: true -> Programmed: false state will be ignored, and all the policies under this gateway will have the incorrect statuses, so we probably should not ignore them at the event level 🤔

Exactly, i don't think any policy should be ignoring events, they should all go down to the policy controller and it decides what to do.

Any assumption made about only caring about gateways with a status "Programmed: true" is wrong anyway as we have at least one policy of our own that does care about Gateways not programmed (TLSPolicy) , and other future polices may do also. So could we just remove these checks?

@eguzki @guicassolato?

@maleck13
Copy link
Collaborator

maleck13 commented Jul 1, 2024

Tend to agree with @mikenairn each policy controller should decide what it wants to do with a given gateway state. TLS is a good example. A gateway wont get into a programmed state unless TLSPolicy has setup the secret for it

@KevFan
Copy link
Contributor Author

KevFan commented Jul 3, 2024

@mikenairn @maleck13 Not sure about completely removing the checks since both checks are primarly used within the reconcilers anyway and there's more instances of wanting to filter them out than not 🤔 The changes in this PR allows passing an option defined by the policy type whether it should do this filtering or not, with the DAG defaulting to always perform this filtering unless otherwise specified because of this majority (can default to not filter if people prefer).

Otherwise I think completely removing the checks and doing a refactor to perform this at each controller level elsewhere seems a bit out of scope for this PR

@eguzki
Copy link
Contributor

eguzki commented Jul 3, 2024

I am a bit out of context. But I will try to contribute a bit, hopefully not confusing things even more.

i don't think any policy should be ignoring events, they should all go down to the policy controller and it decides what to do.

I would not state like that. Policy controllers setup watchers, mappers and predicates to catch only relevant events for the resources (or subresources) they are reconciling. Ignoring not relevant events to save NO-OP reconcilliation triggers.

Any assumption made about only caring about gateways with a status "Programmed: true" is wrong anyway as we have at least one policy of our own that does care about Gateways not programmed (TLSPolicy) , and other future polices may do also. So could we just remove these checks?

I agree, but @KevFan implemented in a way you can opt out.

The changes in this PR allows passing an option defined by the policy type whether it should do this filtering or not, with the DAG defaulting to always perform this filtering unless otherwise specified because of this majority (can default to not filter if people prefer).

Have you considered having the topology component always add all the gateways to the graph and then have the topology index layer (the component wrapping the topology component), a new method called GetProgrammedGateways() that uses an precomputed index of programmed gateways?

Yet another option would be that each controller decides which gateways are feeded into the topology. Nothing stops you from doing a pre-filtering step to filter out not programmed gateways and feed the topology with only programmed gateways.

Between, your implemented option, the index layer and the pre-filtering approach, I would pick the index layer approach. But all three are good enough for me. The reason to pick the index layer is because it is somewhat weird to feed the topology with N gateways and when I requests gateways out of it I get fewer (or even none) instances of gateways.

@KevFan
Copy link
Contributor Author

KevFan commented Jul 4, 2024

Between, your implemented option, the index layer and the pre-filtering approach, I would pick the index layer approach. But all three are good enough for me. The reason to pick the index layer is because it is somewhat weird to feed the topology with N gateways and when I requests gateways out of it I get fewer (or even none) instances of gateways.

Index layer approach sounds good to me, however I prefer the current implementation since it works with both the FetchTargetRefObject and the DAG. The other approaches seems a bit out of scope for this PR at least that focuses on mainly the TLSPolicy. I'd rather opt for a follow up PR for this if we want it since it has a greater affect on all the areas that uses the DAG 🤔

Copy link
Contributor

@dlaw4608 dlaw4608 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verification Steps worked as expected, The TLS policy is created and enforced correctly.

@KevFan
Copy link
Contributor Author

KevFan commented Jul 8, 2024

@mikenairn are you okay with the changes otherwise or is there anything else outstanding? Would appreciate a review from you or someone more familiar with the TLSPolicy to ensure the changes makes sense 🤔

Copy link
Member

@mikenairn mikenairn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikenairn are you okay with the changes otherwise or is there anything else outstanding? Would appreciate a review from you or someone more familiar with the TLSPolicy to ensure the changes makes sense 🤔

Left a few comments/questions, but generally i think it looks fine. My initial questions were really around the need for the Programmed status checks at all, but if they are deemed necessary, then the approach here makes sense.

Have we tried it when cert manager isn't installed though?

api/v1alpha1/dnspolicy_types.go Show resolved Hide resolved
pkg/library/reconcilers/fetcher.go Show resolved Hide resolved
tests/common/tlspolicy/tlspolicy_controller_test.go Outdated Show resolved Hide resolved
pkg/library/mappers/gateway.go Show resolved Hide resolved
controllers/tlspolicy_controller.go Show resolved Hide resolved
@KevFan
Copy link
Contributor Author

KevFan commented Jul 9, 2024

Have we tried it when cert manager isn't installed though?

I've tried it when cert manager isn't installed with the following steps:

  • Deploy cluster
make local-env-setup
  • Delete any of the cert manager CRDs
kubectl delete crd certificates.cert-manager.io
kubectl delete crd clusterissuers.cert-manager.io
kubectl delete crd issuers.cert-manager.io
  • CRDs can be re-installed using:
make install-cert-manager
  • Run operater
make run
  • TLS Policy should be disabled with a similar log line to:
2024-07-09T11:12:02+01:00       ERROR   kuadrant-operator.tlspolicy     CertManager CRD was not installed       {"group": "cert-manager.io", "kind": "Certificate", "version": "v1"}
github.com/kuadrant/kuadrant-operator/pkg/library/gatewayapi.IsCertManagerInstalled
        /Users/chfan/dev/kuadrant-operator/pkg/library/gatewayapi/utils.go:162
github.com/kuadrant/kuadrant-operator/controllers.(*TLSPolicyReconciler).SetupWithManager
        /Users/chfan/dev/kuadrant-operator/controllers/tlspolicy_controller.go:205
main.main
        /Users/chfan/dev/kuadrant-operator/main.go:212
runtime.main
        /Users/chfan/sdk/go1.21.10/src/runtime/proc.go:267
2024-07-09T11:12:02+01:00       INFO    kuadrant-operator.tlspolicy     TLSPolicy controller disabled. CertManager was not found

@KevFan KevFan requested a review from mikenairn July 10, 2024 14:18
@KevFan KevFan merged commit f5116f2 into Kuadrant:main Jul 10, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working kind/enhancement New feature or request size/medium
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TLSPolicy enforced status is wrongly true when secret is missing
5 participants