Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate options to speed up reconciliation up for large amount of GWs and Listeners and policies #1085

Open
trepel opened this issue Dec 18, 2024 · 1 comment

Comments

@trepel
Copy link
Contributor

trepel commented Dec 18, 2024

Overview

Recently a scale test has been implemented:
https://github.com/Kuadrant/testsuite/tree/main/scale_test

I did run a few scale test runs. It took quite some time for Kuadrant operator to reconcile all the policies. Many AuthPolicies and RLPs did not get any status for quite some time. After they got some status it often was:
'AuthPolicy waiting for the following components to sync: [AuthConfig (0cbc22a687a9ff2a57c54007e8ad9b6bc17de3744144196b9b8286fb1593f495)]'
and
'RateLimitPolicy waiting for the following components to sync: [Limitador]'

Everything got reconciled successfully and policies got enforced eventually, but it took quite some time:
1GW 16 Listeners -> 16s to get status for all policies
1 GW 32 Listeners -> 120s to get status for all policies
1 GW 48 Listeners -> 7 min to get status for all policies
1 GW 63 Listeners -> 30 min to get status for all policies

In the operator log there were a lot of entries complaining about invalid paths. So I made the HTTPRoutes target specific Listener rather than the whole Gateway. This make the results much nicer:

1 GW 32 Listeners -> 18s to get status for all policies
1 GW 48 Listeners -> 60s to get status for all policies
1 GW 63 Listeners -> 120s to get status for all policies

I tried with 2 GWs as well:
2 GW 16 Listeners -> 18s to get status for all policies
2 GW 32 Listeners -> 76s to get status for all policies

However, this was still too much:
2 GW 63 Listeners -> 16 min to get status for all policies

Initial Investigation

Be aware that certificate generation and DNS record creation might affect the results. It takes some for certificates to get created (the scale test uses self signed cluster issuer) and it also takes time for cloud provider to issue that many DNS records.

It seems reasonable that wasm config (only one per GW) is a contention point that said it should eventually get there.
There are repeated log entries of “failed to update the object has been modified; please apply your changes to the latest version” in Kuadrant operator pod.

Also entries like "failed to create SOMETHING, SOMETHING already exists" appeared in Kuadrant operator pod log (SOMETHING being Certificate/AuthConfig typically) - not sure if this indicates some issue or not.

Questions / Investigation required

Does it make sense that having many invalid paths is so expensive?
What can be done to improve on that 16 minutes? 2 Gateways and 63 Listeners on each are not super high numbers.

Steps to reproduce

Basically follow the readme of the scale test:
https://github.com/Kuadrant/testsuite/tree/main/scale_test
I used OCP on AWS (6 worker nodes) and DNS setup on the same AWS account
This was done against Kuadrant v1.0.1, OCP V4.17.7

It is enough details here I believe, for even more detais see (Red Hat only, sorry):
https://docs.google.com/document/d/1ATH2aZJ7-qlYTV3jF_rZduMC1MTPKoWCD-N4LmmcaMA/edit?tab=t.0

@trepel trepel added this to Kuadrant Dec 18, 2024
@trepel
Copy link
Contributor Author

trepel commented Dec 18, 2024

PR with added sectionName:
Kuadrant/testsuite#610

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

1 participant