Investigate options to speed up reconciliation up for large amount of GWs and Listeners and policies #1085

trepel · 2024-12-18T14:16:06Z

Overview

Recently a scale test has been implemented:
https://github.com/Kuadrant/testsuite/tree/main/scale_test

I did run a few scale test runs. It took quite some time for Kuadrant operator to reconcile all the policies. Many AuthPolicies and RLPs did not get any status for quite some time. After they got some status it often was:
'AuthPolicy waiting for the following components to sync: [AuthConfig (0cbc22a687a9ff2a57c54007e8ad9b6bc17de3744144196b9b8286fb1593f495)]'
and
'RateLimitPolicy waiting for the following components to sync: [Limitador]'

Everything got reconciled successfully and policies got enforced eventually, but it took quite some time:
1GW 16 Listeners -> 16s to get status for all policies
1 GW 32 Listeners -> 120s to get status for all policies
1 GW 48 Listeners -> 7 min to get status for all policies
1 GW 63 Listeners -> 30 min to get status for all policies

In the operator log there were a lot of entries complaining about invalid paths. So I made the HTTPRoutes target specific Listener rather than the whole Gateway. This make the results much nicer:

1 GW 32 Listeners -> 18s to get status for all policies
1 GW 48 Listeners -> 60s to get status for all policies
1 GW 63 Listeners -> 120s to get status for all policies

I tried with 2 GWs as well:
2 GW 16 Listeners -> 18s to get status for all policies
2 GW 32 Listeners -> 76s to get status for all policies

However, this was still too much:
2 GW 63 Listeners -> 16 min to get status for all policies

Initial Investigation

Be aware that certificate generation and DNS record creation might affect the results. It takes some for certificates to get created (the scale test uses self signed cluster issuer) and it also takes time for cloud provider to issue that many DNS records.

It seems reasonable that wasm config (only one per GW) is a contention point that said it should eventually get there.
There are repeated log entries of “failed to update the object has been modified; please apply your changes to the latest version” in Kuadrant operator pod.

Also entries like "failed to create SOMETHING, SOMETHING already exists" appeared in Kuadrant operator pod log (SOMETHING being Certificate/AuthConfig typically) - not sure if this indicates some issue or not.

Questions / Investigation required

Does it make sense that having many invalid paths is so expensive?
What can be done to improve on that 16 minutes? 2 Gateways and 63 Listeners on each are not super high numbers.

Steps to reproduce

Basically follow the readme of the scale test:
https://github.com/Kuadrant/testsuite/tree/main/scale_test
I used OCP on AWS (6 worker nodes) and DNS setup on the same AWS account
This was done against Kuadrant v1.0.1, OCP V4.17.7

It is enough details here I believe, for even more detais see (Red Hat only, sorry):
https://docs.google.com/document/d/1ATH2aZJ7-qlYTV3jF_rZduMC1MTPKoWCD-N4LmmcaMA/edit?tab=t.0

The text was updated successfully, but these errors were encountered:

trepel · 2024-12-18T15:15:23Z

PR with added sectionName:
Kuadrant/testsuite#610

trepel added this to Kuadrant Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate options to speed up reconciliation up for large amount of GWs and Listeners and policies #1085

Investigate options to speed up reconciliation up for large amount of GWs and Listeners and policies #1085

trepel commented Dec 18, 2024

trepel commented Dec 18, 2024

Investigate options to speed up reconciliation up for large amount of GWs and Listeners and policies #1085

Investigate options to speed up reconciliation up for large amount of GWs and Listeners and policies #1085

Comments

trepel commented Dec 18, 2024

Overview

Initial Investigation

Questions / Investigation required

Steps to reproduce

trepel commented Dec 18, 2024