Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconcile out of order gateway resource creation. #638

Merged
merged 17 commits into from
Jun 26, 2024

Conversation

Boomatang
Copy link
Contributor

@Boomatang Boomatang commented May 10, 2024

closes: #614

This address the issue by selecting all the AuthPolicies in the namespace of the Gateway CR that have ACCEPTED: false and REASON: TargetNotFound. With this list of resources the reconcile loop is triggered. In the diagram below this PR adds the "missing steps".
diagram

Verification Steps

  • Step up an instance of kuadrant and the toystore.
make local-setup
kubectl apply -f examples/toystore/kuadrant.yaml
kubectl apply -f examples/toystore/toystore.yaml
  • Create the AuthPolicies
echo "
apiVersion: kuadrant.io/v1beta2
kind: AuthPolicy
metadata:
  name: gw-1
  namespace: istio-system
spec:
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: gw-1
  rules:
    authentication:
      "api-key-users":
        apiKey:
          selector:
            matchLabels:
              app: toystore
          allNamespaces: true
        credentials:
          authorizationHeader:
            prefix: APIKEY
    response:
      success:
        dynamicMetadata:
          "identity":
            json:
              properties:
                "userid":
                  selector: auth.identity.metadata.annotations.secret\.kuadrant\.io/user-id
---
apiVersion: kuadrant.io/v1beta2
kind: AuthPolicy
metadata:
  name: toystore
  namespace: default
spec:
  targetRef:
    group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: toystore
  rules:
    authentication:
      "api-key-users":
        apiKey:
          selector:
            matchLabels:
              app: toystore
          allNamespaces: true
        credentials:
          authorizationHeader:
            prefix: TOYSTORE
    response:
      success:
        dynamicMetadata:
          "identity":
            json:
              properties:
                "userid":
                  selector: auth.identity.metadata.annotations.secret\.kuadrant\.io/user-id
""" | kubectl apply -f -
  • check the status of the AuthPolices, both should be Accepted: False
  • deploy the gateway
echo "
apiVersion: gateway.networking.k8s.io/v1beta1
kind: Gateway
metadata:
  name: gw-1
  namespace: istio-system
  annotations:
    kuadrant.io/namespace: kuadrant-system
    networking.istio.io/service-type: ClusterIP
spec:
  gatewayClassName: istio
  listeners:
    - name: apis
      port: 80
      protocol: HTTP
      hostname: '*.io'
      allowedRoutes:
        namespaces:
          from: All
" | kubectl apply -f -
- Expect AuthPolicy gw-1 to change its status to `Accepted: True, Enforced: False`. Policy is not enforced as there is no routes attached to the gateway.
- Create route
```sh
echo "
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: toystore
  namespace: default
spec:
  parentRefs:
  - name: gw-1
    namespace: istio-system
  hostnames:
  - api.toystore.io
  rules:
  - matches:
    - path:
        type: Exact
        value: "/toy"
      method: GET
    backendRefs:
    - name: toystore
      port: 80
" | kubectl apply -f -
  • expected AuthPolicy toystore to change its status to Accepted: True, Enforced: True

@Boomatang Boomatang changed the title WIP: Reconcile out of order gateway resource creation. Reconcile out of order gateway resource creation. May 30, 2024
@Boomatang Boomatang marked this pull request as ready for review May 30, 2024 15:20
@Boomatang Boomatang requested a review from a team as a code owner May 30, 2024 15:20
@Boomatang
Copy link
Contributor Author

The failing integration tests seems to be some what of a flaky timing issue now. When the test has a debugger attached 30% of the time the test will pass. So far I have tracked it down to

func ComputeGatewayDiffs(ctx context.Context, k8sClient client.Reader, policy, targetNetworkObject client.Object) (*GatewayDiffs, error) {
returning out-of-date results. I have yet to find why it is out-of-date.

@Boomatang Boomatang added the kind/bug Something isn't working label May 30, 2024
Copy link
Contributor

@guicassolato guicassolato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR, @Boomatang. Replacing the use of the back-reference annotations for the DAG is in the right direction IMO.

Left a few comments for possible enhancements.

Comment on lines 8 to 9
"k8s.io/apimachinery/pkg/runtime/schema"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"k8s.io/apimachinery/pkg/runtime/schema"
"k8s.io/apimachinery/pkg/runtime/schema"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what change you are looking for here. Can you clarify?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to remove the empty space and group the import along with the other third-party deps.

@@ -31,7 +32,9 @@ func (r *TLSPolicyReconciler) reconcileCertificates(ctx context.Context, tlsPoli
}

// Reconcile Certificates for each gateway directly referred by the policy (existing and new)
for _, gw := range append(gwDiffObj.GatewaysWithValidPolicyRef, gwDiffObj.GatewaysMissingPolicyRef...) {
gwList := append(gwDiffObj.GatewaysWithValidPolicyRef, gwDiffObj.GatewaysMissingPolicyRef...)
time.Sleep(250 * time.Millisecond) // Sleep required to make "should delete tls certificate when listener is removed" integration test pass.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we removing this? We shouldn't modify the implementation to make a test pass. Most certainly not by adding a sleep.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I do agree with you that we should not be adding the sleep, after two days of probing around I could not find any thing better than this. The whole thing has something fishy about it. If you flip the list creation after the sleep the test will fail. It makes no sense and at some point the work needs to be done. Personal I would prefer to have the sleep in the code over the having the bug that the PR fixes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Boomatang Do you have more details on why this test in particular is an issue? Does the test fail consistently? How does it fail (What error)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure @mikenairn , the failing test is TLSPolicy controller with multiple https listener [It] should delete tls certificate when listener is removed and the test fails with:

  Expected
      <*errors.errorString | 0xc0051a94b0>: 
      expected 2 certificates, found: 3
      {
          s: "expected 2 certificates, found: 3",
      }
  to be nil

It fails consistently if the time.Sleep is removed.

Copy link
Contributor

@KevFan KevFan Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a look into this. This seems to be something to do more with the test scenario. The created test gateway status has the following:

conditions:
  - lastTransitionTime: "1970-01-01T00:00:00Z"
    message: Waiting for controller
    reason: Pending
    status: Unknown
    type: Accepted
  - lastTransitionTime: "1970-01-01T00:00:00Z"
    message: Waiting for controller
    reason: Pending
    status: Unknown
    type: Programmed

And it never goes to a Programmed: True state.

The topology filters outs Gateways not in the Programmed: true state at:
https://github.com/Boomatang/kuadrant-operator/blob/dc5f92b6520bc9ee03740457ba512ce65381a5dd/pkg/library/gatewayapi/topology.go#L202-L206

This causes the gateway change event at https://github.com/Boomatang/kuadrant-operator/blob/fix/GH614/controllers/tlspolicy_controller_test.go#L410-L413 to never be reconciled by the TLS Policy controller as the request was filtered out at the event mapper in
https://github.com/Boomatang/kuadrant-operator/blob/dc5f92b6520bc9ee03740457ba512ce65381a5dd/pkg/library/mappers/gateway.go#L59-L63

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i cant really work out how any of the TLS tests are working tbh as it appears we removed that some time ago.

The tests still work because this is checking for "False", which the test Gateway doesn't have because it's "Unknown" since no controller has picked it up (As expected). This however is checking for "True", so @KevFan has a point that might be an issue since it won't trigger based on the gateway event if it's going through that filter. I'd say these two things should at least be doing the same check, but there is a regression that was introduced when this was changed since a Gateway won't become ready (Programmed: true) if there is a tls section and the secret is missing, the thing the TLSPolicy is ensuring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what @mikenairn said, so while the test is passing for now I think it will be flaky at best (at least for me). @Boomatang Let me know if you want to fix the regression issue as part of this PR, if not, I can look into that separately since I introduced that regression issue 🧑‍💻

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this it's self is fixing a bug and has being opened for awhile, I think it would be best to try get to the bottom of this flakiness in a second PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this got merged without fixing the flaky test it was introducing https://github.com/Kuadrant/kuadrant-operator/actions/runs/9699428897/job/26768361658?pr=716

Any plans to fix that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan on addressing it as part of #715

gateway, ok := obj.(*gatewayapiv1.Gateway)
if !ok {
logger.Info("cannot map gateway related event to kuadrant policy", "error", fmt.Sprintf("%T is not a *gatewayapiv1beta1.Gateway", obj))
logger.V(1).Info(fmt.Sprintf("%T is not type gateway, unable to map policies to gateway", obj))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to debug level? So we want it to abort silently now? Is there even a case where we expect a gateway mapper to receive an object that is not a gateway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, I did change some of them. I do however think it is still better to keep the error field in the log statement.

Comment on lines 46 to 96
policyList := &unstructured.UnstructuredList{}
policyList.SetAPIVersion(policyGVK.Version)
policyList.SetKind(policyGVK.Kind)
if err := m.opts.Client.List(ctx, policyList, client.InNamespace(obj.GetNamespace())); err != nil {
logger.V(1).Info("unable to list UnstructuredList of policies, %T", "policyGVl", policyGVK)
return []reconcile.Request{}
}

requests := make([]reconcile.Request, 0)
var policies []kuadrantgatewayapi.Policy
if err := policyList.EachListItem(func(obj runtime.Object) error {
objBytes, err := json.Marshal(obj)
if err != nil {
return err
}

for _, policyKey := range kuadrant.BackReferencesFromObject(gateway, policyKind) {
logger.V(1).Info("kuadrant policy possibly affected by the gateway related event found", policyKind.Kind(), policyKey)
requests = append(requests, reconcile.Request{NamespacedName: policyKey})
switch obj.GetObjectKind().GroupVersionKind().Kind {
case "AuthPolicy":
policy := &api.AuthPolicy{}
err = json.Unmarshal(objBytes, policy)
if err != nil {
return err
}
policies = append(policies, policy)
case "DNSPolicy":
policy := &v1alpha1.DNSPolicy{}
err = json.Unmarshal(objBytes, policy)
if err != nil {
return err
}
policies = append(policies, policy)
case "TLSPolicy":
policy := &v1alpha1.TLSPolicy{}
err = json.Unmarshal(objBytes, policy)
if err != nil {
return err
}
policies = append(policies, policy)
case "RateLimitPolicy":
policy := &api.RateLimitPolicy{}
err = json.Unmarshal(objBytes, policy)
if err != nil {
return err
}
policies = append(policies, policy)
default:
return fmt.Errorf("unknown policy kind: %s", obj.GetObjectKind().GroupVersionKind().Kind)
}
return nil
}); err != nil {
logger.V(1).Error(err, "unable to map unstructured List of policies")
return []reconcile.Request{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could leverage an interface here instead and thus avoid the strict switch clause for known kinds only. kuadrantgatewayapi.Policy seem a good candidate.

type Policy interface {
	…
	List(context.Context, client.Client, string) []Policy
}

This would allow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can check this again, but I am fairly sure I ran into issues with unmarshaling when trying to use the Policy type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change has being made. It works well and is cleaner.

logger.Info("cannot list httproutes", "error", err)
continue
}
policyList := &unstructured.UnstructuredList{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. No need to use UnstructuredList if the policy kinds themselves can use a client to list and return []kuadrantgatewayapi.Policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

typeField dag.Field = dag.Field("type")
gatewayLabel dag.NodeLabel = dag.NodeLabel("gateway")
httprouteLabel dag.NodeLabel = dag.NodeLabel("httproute")
HTTPRouteGatewayParentField = ".metadata.parentRefs.gateway"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also declared at:

HTTPRouteGatewayParentField = ".metadata.parentRefs.gateway"

Better having it in one place only IMO. Here, in the gatewayapi package, seems adequate. Maybe remove it from controllers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored

Comment on lines 137 to 150
// This block is required when a HTTProute has being deleted
var policy kuadrant.Referrer
switch policyGVK.Kind {
case "AuthPolicy":
policy = &api.AuthPolicy{}
case "DNSPolicy":
policy = &v1alpha1.DNSPolicy{}
case "TLSPolicy":
policy = &v1alpha1.TLSPolicy{}
case "RateLimitPolicy":
policy = &api.RateLimitPolicy{}
default:
return requests
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't need this before when passing an object that implements Referrer as argument. If we can avoid strict switch clauses, I personally think it's desirable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need to double check this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

@Boomatang Boomatang force-pushed the fix/GH614 branch 2 times, most recently from 702f77d to ab9574b Compare June 12, 2024 13:52
@Boomatang Boomatang requested a review from guicassolato June 17, 2024 09:59
Copy link
Contributor

@eguzki eguzki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you are trying to do. Looks good to me for now.

However, I believe this is not the best approach. Adding "scan all" calls to the mappers looks like something we want to avoid. The mappers are called multiple times, sometimes generating reconciliation events and sometimes not.

The approach I would follow is the one followed by the status controllers. I think this is the right one. In short: a controller only responsible for the status conditions of the policies. That controller will be getting reconciliation events on Gateway resource, hence

return ctrl.NewControllerManagedBy(mgr).    
    For(&gatewayapiv1.Gateway{}).           

The controller will watch for policies, having a mapper to the gateway scaling up in the Gateway API topology (no need to create topology in the mappers)

The controller will watch for httproutes, having a mapper to the gateway scaling up in the Gateway API topology (no need to create topology in the mappers).

That controller will be able to catch the event of a newly created route. The cluster has one policy with the status condition not accepted. The controlller will generate a DAG topology and fill the status with the latest state of the cluster.

Besides, I propose to create a new index for the topology: all orphaned (or broken targetRefs) policies of a cluster . Then the controller will reconcile the status conditions of all the policies being aware of the latest state of the cluster. That covers use case of a route being deleted after the policy targeting at that is in accepted state. Currently, we cover that use case with backreferences. But backreferences have issues and when we get rid of them, this use case will be an issue. An issue that the proposed approach covers.

@@ -178,6 +180,21 @@ func (p *DNSPolicy) GetStatus() kuadrantgatewayapi.PolicyStatus {

func (p *DNSPolicy) Kind() string { return p.TypeMeta.Kind }

func (p *DNSPolicy) List(ctx context.Context, c client.Client, namespace string) []kuadrantgatewayapi.Policy {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not fuction instead of method?

func DNSPolicyList(ctx context.Context, c client.Client, namespace string) []kuadrantgatewayapi.Policy {
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of where it is being used. The List method was added to the Policy Interface as it required in the mapping functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bad smell is that you are not using the receiver, thus looks like a dog, walks like a dog and barks like a dog, but it is a cat. (dog => function, cat => method)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the previous iterations of this PR did not have this. Instead it listed the policies from the cluster based on the type passed in. There was a switch within the mapper function. It was suggested that this was the cleaner approach. I understand why you don't like it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method could have been name sth like ListAllOfKind.

api/v1beta2/ratelimitpolicy_types.go Outdated Show resolved Hide resolved
@Boomatang
Copy link
Contributor Author

@eguzki Thanks for the comments. On the longer comment about the "scan all". I agree I don't like that approach and in the initial iteration of this PR that was not what happened. Instead it went looking for policy that were in a bad state and rerun the reconcile for these policies. If I had know about the indexer functions at that time, I would have made one.

The use of the DAG came from a suggestion on the PR. Which also makes sense as it is meant to be our preferred way of doing things now. The fact it is not an in-memory data store and every time we want to use it we have to do the "scan all", is a problem.

Your mention of the status controller has me a bit confused. Are you asking to write a new controller or rewrite how all the controller managers are initialized? Is it something you want changed in this PR or is in something that can be progressed later?

@eguzki
Copy link
Contributor

eguzki commented Jun 20, 2024

Your mention of the status controller has me a bit confused. Are you asking to write a new controller or rewrite how all the controller managers are initialized? Is it something you want changed in this PR or is in something that can be progressed later?

nothing to be changed in this PR (unless you wanted to). It is what I consider the best way to tackle the issue being addressed here.

@Boomatang Boomatang self-assigned this Jun 21, 2024
Copy link
Contributor

@KevFan KevFan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified, looks good to me anyway 👍

Copy link
Contributor

@guicassolato guicassolato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@Boomatang Boomatang merged commit ce58882 into Kuadrant:main Jun 26, 2024
18 checks passed
@Boomatang Boomatang deleted the fix/GH614 branch June 26, 2024 10:43
@eguzki
Copy link
Contributor

eguzki commented Jun 28, 2024

I now think that the approach followed here is, indeed, the right one.

@eguzki eguzki mentioned this pull request Jul 1, 2024
12 tasks
dlaw4608 pushed a commit to dlaw4608/kuadrant-operator that referenced this pull request Jul 3, 2024
Reconcile out of order gateway resource creation. (Kuadrant#638)

* On gateway creation if no backrefs are found reconcile AuthPolices with a status of ACCEPTED: false and REASON: TargetNotFound.

* using dag for reconcile trigger.

* fix unit tests

* Tidy up of some WIP code

* Fix the out-of-order creation issue for authPolices.

* Fix issues raised from the CI tests

* Rebase updates

* Fix integration test, sleep required for some reason.

* Refactor to remove the need for switches.

* Update unit tests to match refactor

* Remove deprecated `FromInt`

* Updates after rebase

* small update to make lists.

* rebase format fix

* Revert time.sleep

* Add logging for error message

* Address comment request

chore: Remove DNSRecord ownerID generation (Kuadrant#716)

The `ownerID` field on DNSRecord is now optional, the dns operator will
now ensure a valid unique ownerID for all DNSRecords if none is
explicitly set.

 Add TargetRef to AuthPolicy and Fix Formatting and Integration Tests

- Added TargetRef to AuthPolicy to specify the target reference.
- Replaced the gateway object with the gateway wrapped inside the gatewayWrapper.
- Fixed various errors and improved code formatting.
- Resolved issues with Istio integration tests.

This commit consolidates multiple changes made during the development process, improving the stability and functionality of the AuthPolicy controller and related tests.

Add TargetRef to AuthPolicy and Fix Formatting and Integration Tests  - Added TargetRef to AuthPolicy to specify the target reference.
    - Replaced the gateway object with the gateway wrapped inside the gatewayWrapper.
    - Fixed various errors and improved code formatting.
    - Resolved issues with Istio integration tests.

    This commit consolidates multiple changes made during the development process, improving the stability and functionality of the AuthPolicy controller and related tests.

understanding worflow github actions example

updated kind cluster version from v0.22.0 to v0.23.0, along with node image from v1.29.2 to v1.30.0

Removed unnecessary file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Order of creation
5 participants