Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Security Solution]
/upgrade/_perform
performance testing (elastic#…
…197898) ## Summary - Creates a new `withSyncSecuritySpan` wrapper to measure sync functions in APM. Adds this wrapper to new CPU intensive logic in the `/upgrade/_perform` endpoint. - Do performance testing on the endpoint. See results below. ## Performance testing ### Possible OOMs in production Serverless Created an Serverless instance and manually installed different Prebuilt Rules package to force rule upgrades. - With the current published prebuilt packages, a user cannot update more than 950 rules with a single request. - This number is expected to grow, but at a slower pace than the actual number of rules being published. - Also, as users start customizing rules, rules with conflicts will be excluded from bulk requests, which will **make payloads even smaller.** - Testing the biggest possible upgrade request, Serverless behaved reliably and no **timeouts** or **OOMs** occurred: | From version | To version | Rule Updates | Request time | |---------|--------|---------|--------| | 8.9.9 | 8.15.9 | 913 | 47.3s | | 8.9.12 | 8.15.9 | 917 | 52.34s | | 8.9.15 | 8.15.9 | 928 | 56.08s | | 8.10.4 | 8.15.9 | 872 | 43.29s | | 8.10.5 | 8.15.9 | 910 | 52.21s | | 8.10.6 | 8.15.9 | 913 | 55.92s | | 8.10.7 | 8.15.9 | 924 | 49.89s | | 8.11.2 | 8.15.9 | 910 | 56.48s | | 8.11.5 | 8.15.9 | 928 | 49.22s | | 8.11.16 | 8.15.9 | 695 | 38.91s | | 8.12.6 | 8.15.9 | 947 | 51.13s | | 8.13.11 | 8.15.9 | 646 | 42.98s | - Given the positive results for much bigger payloads seen in the **Memory profiling with limited memory in Kibana production mode** below, we can assume that there's no risk of OOMs in Serverless at the moment. ### Memory profiling with limited memory in Kibana production mode - Launched Kibana in Production mode, and set a **memory limit of 700mb** to mimic as closely as possible the Serverless environment (where memory is a hard constraint) - Stress tested with big number of requests and saw the following behaviour: | Rule Updates | Request time (min) | OOM error? | Metrics | |---------|--------|--------|--------| | 1500 | 1.1 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/46303a1a-a929-4c00-8777-8d1f23face17)</details> | | 2000 | 1.5 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/bd33d259-50fd-42df-947d-3a2e7c5c78c3)</details> | | 2500 | 1.8 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9145d2e7-e87c-4ba6-8633-7fe1087c29fb)</details> | | 2750 | 1.9 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9009163e-f58d-4be3-8a1f-87844760a037)</details> | | 3000 | - | YES | | - Rule upgrade OOM's consistently when the payload is >= 3000 rules, but behaves reliably below that. Good enough buffer for growth of the Prebuilt Rules package. - Also, the saw-toothed shape of the heap used graphics shows that garbage collection works properly for payloads under 3000 rules. ### APM request profiling - Connected Kibana in production mode to a APM server to measure spans of the `/upgrade/_perform` request. - Additionally, measured new CPU-intensive logic which calculates rule diffs and create rule payloads for upgrades. - An example span for a successful upgrade of 2500 rules: <img width="1722" alt="image" src="https://github.com/user-attachments/assets/07aa3079-5ce4-4b87-ab41-2a3e133316ef"> - The new spans for CPU-intensive tasks `getUpgradeableRules` and `createModifiedPrebuiltRuleAssets`, which are displayed as `blocking`, have an acceptable span length, and do not have a considerable overhead on the total length of the request. ### Timeout testing - Tested Kibana with `--no-base-path` in order to check for potential timeouts in long running requests (ESS Cloud proxy is supposed to have a request timeout config of 2.5 mins) - Still [confirming](https://elastic.slack.com/archives/C5UDAFZQU/p1730297621776729) with Kibana Operations the behaviour of the timeouts in ESS and Serverless envs: - Tested with mock rules (indexed directly to ES) and **no timeouts occurred**: | Rule Updates | Request time (min) | |---------|--------| | 2000 | 2.1 | | 2000 | 2.1 | | 2250 | 2.3 | | 2500 | 2.6 | | 3000 | 3.1 | ### Conclusion The results show that the `/upgrade/_perform` endpoint performs reliably under stress, given the currentexpected request payloads. The only question to triple check is the behaviour of server request timeouts in Serverless: I'm waiting the Kibana ops team to get back to me, even though testing here did not show cases of timeouts. --------- Co-authored-by: Elastic Machine <[email protected]> Co-authored-by: Dmitrii Shevchenko <[email protected]>
- Loading branch information