-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconcile out of order gateway resource creation. #638
Conversation
The failing integration tests seems to be some what of a flaky timing issue now. When the test has a debugger attached 30% of the time the test will pass. So far I have tracked it down to
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR, @Boomatang. Replacing the use of the back-reference annotations for the DAG is in the right direction IMO.
Left a few comments for possible enhancements.
controllers/authpolicy_controller.go
Outdated
"k8s.io/apimachinery/pkg/runtime/schema" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"k8s.io/apimachinery/pkg/runtime/schema" | |
"k8s.io/apimachinery/pkg/runtime/schema" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what change you are looking for here. Can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was to remove the empty space and group the import along with the other third-party deps.
@@ -31,7 +32,9 @@ func (r *TLSPolicyReconciler) reconcileCertificates(ctx context.Context, tlsPoli | |||
} | |||
|
|||
// Reconcile Certificates for each gateway directly referred by the policy (existing and new) | |||
for _, gw := range append(gwDiffObj.GatewaysWithValidPolicyRef, gwDiffObj.GatewaysMissingPolicyRef...) { | |||
gwList := append(gwDiffObj.GatewaysWithValidPolicyRef, gwDiffObj.GatewaysMissingPolicyRef...) | |||
time.Sleep(250 * time.Millisecond) // Sleep required to make "should delete tls certificate when listener is removed" integration test pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we removing this? We shouldn't modify the implementation to make a test pass. Most certainly not by adding a sleep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I do agree with you that we should not be adding the sleep, after two days of probing around I could not find any thing better than this. The whole thing has something fishy about it. If you flip the list creation after the sleep the test will fail. It makes no sense and at some point the work needs to be done. Personal I would prefer to have the sleep in the code over the having the bug that the PR fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Boomatang Do you have more details on why this test in particular is an issue? Does the test fail consistently? How does it fail (What error)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure @mikenairn , the failing test is TLSPolicy controller with multiple https listener [It] should delete tls certificate when listener is removed
and the test fails with:
Expected
<*errors.errorString | 0xc0051a94b0>:
expected 2 certificates, found: 3
{
s: "expected 2 certificates, found: 3",
}
to be nil
It fails consistently if the time.Sleep
is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a look into this. This seems to be something to do more with the test scenario. The created test gateway status has the following:
conditions:
- lastTransitionTime: "1970-01-01T00:00:00Z"
message: Waiting for controller
reason: Pending
status: Unknown
type: Accepted
- lastTransitionTime: "1970-01-01T00:00:00Z"
message: Waiting for controller
reason: Pending
status: Unknown
type: Programmed
And it never goes to a Programmed: True
state.
The topology filters outs Gateways not in the Programmed: true
state at:
https://github.com/Boomatang/kuadrant-operator/blob/dc5f92b6520bc9ee03740457ba512ce65381a5dd/pkg/library/gatewayapi/topology.go#L202-L206
This causes the gateway change event at https://github.com/Boomatang/kuadrant-operator/blob/fix/GH614/controllers/tlspolicy_controller_test.go#L410-L413 to never be reconciled by the TLS Policy controller as the request was filtered out at the event mapper in
https://github.com/Boomatang/kuadrant-operator/blob/dc5f92b6520bc9ee03740457ba512ce65381a5dd/pkg/library/mappers/gateway.go#L59-L63
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i cant really work out how any of the TLS tests are working tbh as it appears we removed that some time ago.
The tests still work because this is checking for "False", which the test Gateway doesn't have because it's "Unknown" since no controller has picked it up (As expected). This however is checking for "True", so @KevFan has a point that might be an issue since it won't trigger based on the gateway event if it's going through that filter. I'd say these two things should at least be doing the same check, but there is a regression that was introduced when this was changed since a Gateway won't become ready (Programmed: true) if there is a tls section and the secret is missing, the thing the TLSPolicy is ensuring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to what @mikenairn said, so while the test is passing for now I think it will be flaky at best (at least for me). @Boomatang Let me know if you want to fix the regression issue as part of this PR, if not, I can look into that separately since I introduced that regression issue 🧑💻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this it's self is fixing a bug and has being opened for awhile, I think it would be best to try get to the bottom of this flakiness in a second PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this got merged without fixing the flaky test it was introducing https://github.com/Kuadrant/kuadrant-operator/actions/runs/9699428897/job/26768361658?pr=716
Any plans to fix that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan on addressing it as part of #715
pkg/library/mappers/gateway.go
Outdated
gateway, ok := obj.(*gatewayapiv1.Gateway) | ||
if !ok { | ||
logger.Info("cannot map gateway related event to kuadrant policy", "error", fmt.Sprintf("%T is not a *gatewayapiv1beta1.Gateway", obj)) | ||
logger.V(1).Info(fmt.Sprintf("%T is not type gateway, unable to map policies to gateway", obj)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching to debug level? So we want it to abort silently now? Is there even a case where we expect a gateway mapper to receive an object that is not a gateway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, I did change some of them. I do however think it is still better to keep the error field in the log statement.
pkg/library/mappers/gateway.go
Outdated
policyList := &unstructured.UnstructuredList{} | ||
policyList.SetAPIVersion(policyGVK.Version) | ||
policyList.SetKind(policyGVK.Kind) | ||
if err := m.opts.Client.List(ctx, policyList, client.InNamespace(obj.GetNamespace())); err != nil { | ||
logger.V(1).Info("unable to list UnstructuredList of policies, %T", "policyGVl", policyGVK) | ||
return []reconcile.Request{} | ||
} | ||
|
||
requests := make([]reconcile.Request, 0) | ||
var policies []kuadrantgatewayapi.Policy | ||
if err := policyList.EachListItem(func(obj runtime.Object) error { | ||
objBytes, err := json.Marshal(obj) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
for _, policyKey := range kuadrant.BackReferencesFromObject(gateway, policyKind) { | ||
logger.V(1).Info("kuadrant policy possibly affected by the gateway related event found", policyKind.Kind(), policyKey) | ||
requests = append(requests, reconcile.Request{NamespacedName: policyKey}) | ||
switch obj.GetObjectKind().GroupVersionKind().Kind { | ||
case "AuthPolicy": | ||
policy := &api.AuthPolicy{} | ||
err = json.Unmarshal(objBytes, policy) | ||
if err != nil { | ||
return err | ||
} | ||
policies = append(policies, policy) | ||
case "DNSPolicy": | ||
policy := &v1alpha1.DNSPolicy{} | ||
err = json.Unmarshal(objBytes, policy) | ||
if err != nil { | ||
return err | ||
} | ||
policies = append(policies, policy) | ||
case "TLSPolicy": | ||
policy := &v1alpha1.TLSPolicy{} | ||
err = json.Unmarshal(objBytes, policy) | ||
if err != nil { | ||
return err | ||
} | ||
policies = append(policies, policy) | ||
case "RateLimitPolicy": | ||
policy := &api.RateLimitPolicy{} | ||
err = json.Unmarshal(objBytes, policy) | ||
if err != nil { | ||
return err | ||
} | ||
policies = append(policies, policy) | ||
default: | ||
return fmt.Errorf("unknown policy kind: %s", obj.GetObjectKind().GroupVersionKind().Kind) | ||
} | ||
return nil | ||
}); err != nil { | ||
logger.V(1).Error(err, "unable to map unstructured List of policies") | ||
return []reconcile.Request{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could leverage an interface here instead and thus avoid the strict switch clause for known kinds only. kuadrantgatewayapi.Policy
seem a good candidate.
type Policy interface {
…
List(context.Context, client.Client, string) []Policy
}
This would allow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can check this again, but I am fairly sure I ran into issues with unmarshaling when trying to use the Policy
type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change has being made. It works well and is cleaner.
pkg/library/mappers/httproute.go
Outdated
logger.Info("cannot list httproutes", "error", err) | ||
continue | ||
} | ||
policyList := &unstructured.UnstructuredList{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. No need to use UnstructuredList
if the policy kinds themselves can use a client to list and return []kuadrantgatewayapi.Policy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
pkg/library/gatewayapi/topology.go
Outdated
typeField dag.Field = dag.Field("type") | ||
gatewayLabel dag.NodeLabel = dag.NodeLabel("gateway") | ||
httprouteLabel dag.NodeLabel = dag.NodeLabel("httproute") | ||
HTTPRouteGatewayParentField = ".metadata.parentRefs.gateway" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also declared at:
HTTPRouteGatewayParentField = ".metadata.parentRefs.gateway" |
Better having it in one place only IMO. Here, in the gatewayapi
package, seems adequate. Maybe remove it from controllers
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored
pkg/library/mappers/httproute.go
Outdated
// This block is required when a HTTProute has being deleted | ||
var policy kuadrant.Referrer | ||
switch policyGVK.Kind { | ||
case "AuthPolicy": | ||
policy = &api.AuthPolicy{} | ||
case "DNSPolicy": | ||
policy = &v1alpha1.DNSPolicy{} | ||
case "TLSPolicy": | ||
policy = &v1alpha1.TLSPolicy{} | ||
case "RateLimitPolicy": | ||
policy = &api.RateLimitPolicy{} | ||
default: | ||
return requests | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't need this before when passing an object that implements Referrer
as argument. If we can avoid strict switch clauses, I personally think it's desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will need to double check this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
702f77d
to
ab9574b
Compare
…es with a status of ACCEPTED: false and REASON: TargetNotFound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you are trying to do. Looks good to me for now.
However, I believe this is not the best approach. Adding "scan all" calls to the mappers looks like something we want to avoid. The mappers are called multiple times, sometimes generating reconciliation events and sometimes not.
The approach I would follow is the one followed by the status controllers. I think this is the right one. In short: a controller only responsible for the status conditions of the policies. That controller will be getting reconciliation events on Gateway resource, hence
return ctrl.NewControllerManagedBy(mgr).
For(&gatewayapiv1.Gateway{}).
The controller will watch for policies, having a mapper to the gateway scaling up in the Gateway API topology (no need to create topology in the mappers)
The controller will watch for httproutes, having a mapper to the gateway scaling up in the Gateway API topology (no need to create topology in the mappers).
That controller will be able to catch the event of a newly created route. The cluster has one policy with the status condition not accepted. The controlller will generate a DAG topology and fill the status with the latest state of the cluster.
Besides, I propose to create a new index for the topology: all orphaned (or broken targetRefs) policies of a cluster . Then the controller will reconcile the status conditions of all the policies being aware of the latest state of the cluster. That covers use case of a route being deleted after the policy targeting at that is in accepted state. Currently, we cover that use case with backreferences. But backreferences have issues and when we get rid of them, this use case will be an issue. An issue that the proposed approach covers.
@@ -178,6 +180,21 @@ func (p *DNSPolicy) GetStatus() kuadrantgatewayapi.PolicyStatus { | |||
|
|||
func (p *DNSPolicy) Kind() string { return p.TypeMeta.Kind } | |||
|
|||
func (p *DNSPolicy) List(ctx context.Context, c client.Client, namespace string) []kuadrantgatewayapi.Policy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not fuction instead of method?
func DNSPolicyList(ctx context.Context, c client.Client, namespace string) []kuadrantgatewayapi.Policy {
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of where it is being used. The List
method was added to the Policy Interface as it required in the mapping functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bad smell is that you are not using the receiver, thus looks like a dog, walks like a dog and barks like a dog, but it is a cat. (dog => function, cat => method)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the previous iterations of this PR did not have this. Instead it listed the policies from the cluster based on the type passed in. There was a switch within the mapper function. It was suggested that this was the cleaner approach. I understand why you don't like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method could have been name sth like ListAllOfKind
.
@eguzki Thanks for the comments. On the longer comment about the "scan all". I agree I don't like that approach and in the initial iteration of this PR that was not what happened. Instead it went looking for policy that were in a bad state and rerun the reconcile for these policies. If I had know about the indexer functions at that time, I would have made one. The use of the DAG came from a suggestion on the PR. Which also makes sense as it is meant to be our preferred way of doing things now. The fact it is not an in-memory data store and every time we want to use it we have to do the "scan all", is a problem. Your mention of the status controller has me a bit confused. Are you asking to write a new controller or rewrite how all the controller managers are initialized? Is it something you want changed in this PR or is in something that can be progressed later? |
nothing to be changed in this PR (unless you wanted to). It is what I consider the best way to tackle the issue being addressed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verified, looks good to me anyway 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I now think that the approach followed here is, indeed, the right one. |
Reconcile out of order gateway resource creation. (Kuadrant#638) * On gateway creation if no backrefs are found reconcile AuthPolices with a status of ACCEPTED: false and REASON: TargetNotFound. * using dag for reconcile trigger. * fix unit tests * Tidy up of some WIP code * Fix the out-of-order creation issue for authPolices. * Fix issues raised from the CI tests * Rebase updates * Fix integration test, sleep required for some reason. * Refactor to remove the need for switches. * Update unit tests to match refactor * Remove deprecated `FromInt` * Updates after rebase * small update to make lists. * rebase format fix * Revert time.sleep * Add logging for error message * Address comment request chore: Remove DNSRecord ownerID generation (Kuadrant#716) The `ownerID` field on DNSRecord is now optional, the dns operator will now ensure a valid unique ownerID for all DNSRecords if none is explicitly set. Add TargetRef to AuthPolicy and Fix Formatting and Integration Tests - Added TargetRef to AuthPolicy to specify the target reference. - Replaced the gateway object with the gateway wrapped inside the gatewayWrapper. - Fixed various errors and improved code formatting. - Resolved issues with Istio integration tests. This commit consolidates multiple changes made during the development process, improving the stability and functionality of the AuthPolicy controller and related tests. Add TargetRef to AuthPolicy and Fix Formatting and Integration Tests - Added TargetRef to AuthPolicy to specify the target reference. - Replaced the gateway object with the gateway wrapped inside the gatewayWrapper. - Fixed various errors and improved code formatting. - Resolved issues with Istio integration tests. This commit consolidates multiple changes made during the development process, improving the stability and functionality of the AuthPolicy controller and related tests. understanding worflow github actions example updated kind cluster version from v0.22.0 to v0.23.0, along with node image from v1.29.2 to v1.30.0 Removed unnecessary file
closes: #614
This address the issue by selecting all the AuthPolicies in the namespace of the Gateway CR that have
ACCEPTED: false
andREASON: TargetNotFound
. With this list of resources the reconcile loop is triggered. In the diagram below this PR adds the "missing steps".Verification Steps
Accepted: False
Accepted: True, Enforced: True