You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I did run a few scale test runs. It took quite some time for Kuadrant operator to reconcile all the policies. Many AuthPolicies and RLPs did not get any status for quite some time. After they got some status it often was: 'AuthPolicy waiting for the following components to sync: [AuthConfig (0cbc22a687a9ff2a57c54007e8ad9b6bc17de3744144196b9b8286fb1593f495)]'
and 'RateLimitPolicy waiting for the following components to sync: [Limitador]'
Everything got reconciled successfully and policies got enforced eventually, but it took quite some time:
1GW 16 Listeners -> 16s to get status for all policies
1 GW 32 Listeners -> 120s to get status for all policies
1 GW 48 Listeners -> 7 min to get status for all policies
1 GW 63 Listeners -> 30 min to get status for all policies
In the operator log there were a lot of entries complaining about invalid paths. So I made the HTTPRoutes target specific Listener rather than the whole Gateway. This make the results much nicer:
1 GW 32 Listeners -> 18s to get status for all policies
1 GW 48 Listeners -> 60s to get status for all policies
1 GW 63 Listeners -> 120s to get status for all policies
I tried with 2 GWs as well:
2 GW 16 Listeners -> 18s to get status for all policies
2 GW 32 Listeners -> 76s to get status for all policies
However, this was still too much:
2 GW 63 Listeners -> 16 min to get status for all policies
Initial Investigation
Be aware that certificate generation and DNS record creation might affect the results. It takes some for certificates to get created (the scale test uses self signed cluster issuer) and it also takes time for cloud provider to issue that many DNS records.
It seems reasonable that wasm config (only one per GW) is a contention point that said it should eventually get there.
There are repeated log entries of “failed to update the object has been modified; please apply your changes to the latest version” in Kuadrant operator pod.
Also entries like "failed to create SOMETHING, SOMETHING already exists" appeared in Kuadrant operator pod log (SOMETHING being Certificate/AuthConfig typically) - not sure if this indicates some issue or not.
Questions / Investigation required
Does it make sense that having many invalid paths is so expensive?
What can be done to improve on that 16 minutes? 2 Gateways and 63 Listeners on each are not super high numbers.
Overview
Recently a scale test has been implemented:
https://github.com/Kuadrant/testsuite/tree/main/scale_test
I did run a few scale test runs. It took quite some time for Kuadrant operator to reconcile all the policies. Many AuthPolicies and RLPs did not get any status for quite some time. After they got some status it often was:
'AuthPolicy waiting for the following components to sync: [AuthConfig (0cbc22a687a9ff2a57c54007e8ad9b6bc17de3744144196b9b8286fb1593f495)]'
and
'RateLimitPolicy waiting for the following components to sync: [Limitador]'
Everything got reconciled successfully and policies got enforced eventually, but it took quite some time:
1GW 16 Listeners -> 16s to get status for all policies
1 GW 32 Listeners -> 120s to get status for all policies
1 GW 48 Listeners -> 7 min to get status for all policies
1 GW 63 Listeners -> 30 min to get status for all policies
In the operator log there were a lot of entries complaining about invalid paths. So I made the HTTPRoutes target specific Listener rather than the whole Gateway. This make the results much nicer:
1 GW 32 Listeners -> 18s to get status for all policies
1 GW 48 Listeners -> 60s to get status for all policies
1 GW 63 Listeners -> 120s to get status for all policies
I tried with 2 GWs as well:
2 GW 16 Listeners -> 18s to get status for all policies
2 GW 32 Listeners -> 76s to get status for all policies
However, this was still too much:
2 GW 63 Listeners -> 16 min to get status for all policies
Initial Investigation
Be aware that certificate generation and DNS record creation might affect the results. It takes some for certificates to get created (the scale test uses self signed cluster issuer) and it also takes time for cloud provider to issue that many DNS records.
It seems reasonable that wasm config (only one per GW) is a contention point that said it should eventually get there.
There are repeated log entries of “failed to update the object has been modified; please apply your changes to the latest version” in Kuadrant operator pod.
Also entries like "failed to create SOMETHING, SOMETHING already exists" appeared in Kuadrant operator pod log (SOMETHING being Certificate/AuthConfig typically) - not sure if this indicates some issue or not.
Questions / Investigation required
Does it make sense that having many invalid paths is so expensive?
What can be done to improve on that 16 minutes? 2 Gateways and 63 Listeners on each are not super high numbers.
Steps to reproduce
Basically follow the readme of the scale test:
https://github.com/Kuadrant/testsuite/tree/main/scale_test
I used OCP on AWS (6 worker nodes) and DNS setup on the same AWS account
This was done against Kuadrant v1.0.1, OCP V4.17.7
It is enough details here I believe, for even more detais see (Red Hat only, sorry):
https://docs.google.com/document/d/1ATH2aZJ7-qlYTV3jF_rZduMC1MTPKoWCD-N4LmmcaMA/edit?tab=t.0
The text was updated successfully, but these errors were encountered: