-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Security Solution] /upgrade/_perform
performance testing
#197898
Conversation
/upgrade/_perform
performance/upgrade/_perform
performance testing
Pinging @elastic/security-detection-rule-management (Team:Detection Rule Management) |
Pinging @elastic/security-detections-response (Team:Detections and Resp) |
Pinging @elastic/security-solution (Team: SecuritySolution) |
@elasticmachine merge upstream |
@jpdjere Great work, thank you for describing in detail the testing you've done 👍 Some questions for discussing offline:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor comments for the code changes.
...detection_engine/prebuilt_rules/api/perform_rule_upgrade/create_upgradeable_rules_payload.ts
Outdated
Show resolved
Hide resolved
const { modifiedPrebuiltRuleAssets, processingErrors } = | ||
upgradeableRules.reduce<ProcessedRules>( | ||
(processedRules, upgradeableRule) => { | ||
const targetRuleType = upgradeableRule.target.type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow-up PR it would be great to try to refactor this function - the nesting here is already beyond manageable/readable.
...server/lib/detection_engine/prebuilt_rules/api/perform_rule_upgrade/get_upgradeable_rules.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/security_solution/server/utils/with_security_span.ts
Outdated
Show resolved
Hide resolved
Update on performance testingProduction Serverless stress testingCreated a new production Serverless instance
Results:
So results were even better than expected, given the previous testing running Kibana in production mode with limited memory (to mimic Serverless).
APM profilingThere seems to be some type of delay or issue with the LINK to Production APM, with filter for production project applied: Anyways, since the PR /upgrade/_perform performance testing #197898 is not merged and released, the spans for the new blocking CPU-intensive logic spans will not be reported on the production APM server. Therefore, I created a PR Serverless deployment to have some insight on how these spans look like. Project Id: The CPU intensive task of |
/redeploy |
💚 Build Succeeded
Metrics [docs]
History
cc @jpdjere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed performance testing, @jpdjere 🚀
The test results look good to me. My main concern was the potential for request timeouts during large rule upgrades, but the tests confirm this isn’t an issue. While the total time to upgrade 1,000 rules is a bit on the high side (~60 seconds), I don't think immediate action is necessary, given that rule upgrades should happen relatively infrequently—probably only a few times per month.
The CPU blocking time of ~700 ms per 1,000 rules also doesn’t seem likely to impact Kibana’s overall performance, especially since it’s an infrequent operation.
That said, as we discussed, it would be good to keep a couple of performance improvement items on the list to tackle after the initial release. Added a ticket to the Milestone 4: #199101
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The results look great! Now I feel quite confident about the performance of the upgrade workflow, it doesn't look like we're going to get timeouts or OOMs any time soon.
Also, thank you for posting the detailed description, sharing the links to dashboards and addressing the comments. LGTM 👍 🚀
Starting backport for target branches: 8.x |
…197898) ## Summary - Creates a new `withSyncSecuritySpan` wrapper to measure sync functions in APM. Adds this wrapper to new CPU intensive logic in the `/upgrade/_perform` endpoint. - Do performance testing on the endpoint. See results below. ## Performance testing ### Possible OOMs in production Serverless Created an Serverless instance and manually installed different Prebuilt Rules package to force rule upgrades. - With the current published prebuilt packages, a user cannot update more than 950 rules with a single request. - This number is expected to grow, but at a slower pace than the actual number of rules being published. - Also, as users start customizing rules, rules with conflicts will be excluded from bulk requests, which will **make payloads even smaller.** - Testing the biggest possible upgrade request, Serverless behaved reliably and no **timeouts** or **OOMs** occurred: | From version | To version | Rule Updates | Request time | |---------|--------|---------|--------| | 8.9.9 | 8.15.9 | 913 | 47.3s | | 8.9.12 | 8.15.9 | 917 | 52.34s | | 8.9.15 | 8.15.9 | 928 | 56.08s | | 8.10.4 | 8.15.9 | 872 | 43.29s | | 8.10.5 | 8.15.9 | 910 | 52.21s | | 8.10.6 | 8.15.9 | 913 | 55.92s | | 8.10.7 | 8.15.9 | 924 | 49.89s | | 8.11.2 | 8.15.9 | 910 | 56.48s | | 8.11.5 | 8.15.9 | 928 | 49.22s | | 8.11.16 | 8.15.9 | 695 | 38.91s | | 8.12.6 | 8.15.9 | 947 | 51.13s | | 8.13.11 | 8.15.9 | 646 | 42.98s | - Given the positive results for much bigger payloads seen in the **Memory profiling with limited memory in Kibana production mode** below, we can assume that there's no risk of OOMs in Serverless at the moment. ### Memory profiling with limited memory in Kibana production mode - Launched Kibana in Production mode, and set a **memory limit of 700mb** to mimic as closely as possible the Serverless environment (where memory is a hard constraint) - Stress tested with big number of requests and saw the following behaviour: | Rule Updates | Request time (min) | OOM error? | Metrics | |---------|--------|--------|--------| | 1500 | 1.1 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/46303a1a-a929-4c00-8777-8d1f23face17)</details> | | 2000 | 1.5 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/bd33d259-50fd-42df-947d-3a2e7c5c78c3)</details> | | 2500 | 1.8 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9145d2e7-e87c-4ba6-8633-7fe1087c29fb)</details> | | 2750 | 1.9 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9009163e-f58d-4be3-8a1f-87844760a037)</details> | | 3000 | - | YES | | - Rule upgrade OOM's consistently when the payload is >= 3000 rules, but behaves reliably below that. Good enough buffer for growth of the Prebuilt Rules package. - Also, the saw-toothed shape of the heap used graphics shows that garbage collection works properly for payloads under 3000 rules. ### APM request profiling - Connected Kibana in production mode to a APM server to measure spans of the `/upgrade/_perform` request. - Additionally, measured new CPU-intensive logic which calculates rule diffs and create rule payloads for upgrades. - An example span for a successful upgrade of 2500 rules: <img width="1722" alt="image" src="https://github.com/user-attachments/assets/07aa3079-5ce4-4b87-ab41-2a3e133316ef"> - The new spans for CPU-intensive tasks `getUpgradeableRules` and `createModifiedPrebuiltRuleAssets`, which are displayed as `blocking`, have an acceptable span length, and do not have a considerable overhead on the total length of the request. ### Timeout testing - Tested Kibana with `--no-base-path` in order to check for potential timeouts in long running requests (ESS Cloud proxy is supposed to have a request timeout config of 2.5 mins) - Still [confirming](https://elastic.slack.com/archives/C5UDAFZQU/p1730297621776729) with Kibana Operations the behaviour of the timeouts in ESS and Serverless envs: - Tested with mock rules (indexed directly to ES) and **no timeouts occurred**: | Rule Updates | Request time (min) | |---------|--------| | 2000 | 2.1 | | 2000 | 2.1 | | 2250 | 2.3 | | 2500 | 2.6 | | 3000 | 3.1 | ### Conclusion The results show that the `/upgrade/_perform` endpoint performs reliably under stress, given the currentexpected request payloads. The only question to triple check is the behaviour of server request timeouts in Serverless: I'm waiting the Kibana ops team to get back to me, even though testing here did not show cases of timeouts. --------- Co-authored-by: Elastic Machine <[email protected]> Co-authored-by: Dmitrii Shevchenko <[email protected]> (cherry picked from commit bdb6ff1)
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
…esting (#197898) (#199128) # Backport This will backport the following commits from `main` to `8.x`: - [[Security Solution] `/upgrade/_perform` performance testing (#197898)](#197898) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Juan Pablo Djeredjian","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-11-06T12:50:44Z","message":"[Security Solution] `/upgrade/_perform` performance testing (#197898)\n\n## Summary\r\n\r\n- Creates a new `withSyncSecuritySpan` wrapper to measure sync functions\r\nin APM. Adds this wrapper to new CPU intensive logic in the\r\n`/upgrade/_perform` endpoint.\r\n- Do performance testing on the endpoint. See results below.\r\n\r\n\r\n## Performance testing\r\n\r\n### Possible OOMs in production Serverless\r\n\r\nCreated an Serverless instance and manually installed different Prebuilt\r\nRules package to force rule upgrades.\r\n- With the current published prebuilt packages, a user cannot update\r\nmore than 950 rules with a single request.\r\n- This number is expected to grow, but at a slower pace than the actual\r\nnumber of rules being published.\r\n- Also, as users start customizing rules, rules with conflicts will be\r\nexcluded from bulk requests, which will **make payloads even smaller.**\r\n- Testing the biggest possible upgrade request, Serverless behaved\r\nreliably and no **timeouts** or **OOMs** occurred:\r\n\r\n| From version | To version | Rule Updates | Request time |\r\n|---------|--------|---------|--------|\r\n| 8.9.9 | 8.15.9 | 913 | 47.3s |\r\n| 8.9.12 | 8.15.9 | 917 | 52.34s |\r\n| 8.9.15 | 8.15.9 | 928 | 56.08s |\r\n| 8.10.4 | 8.15.9 | 872 | 43.29s |\r\n| 8.10.5 | 8.15.9 | 910 | 52.21s |\r\n| 8.10.6 | 8.15.9 | 913 | 55.92s |\r\n| 8.10.7 | 8.15.9 | 924 | 49.89s |\r\n| 8.11.2 | 8.15.9 | 910 | 56.48s |\r\n| 8.11.5 | 8.15.9 | 928 | 49.22s |\r\n| 8.11.16 | 8.15.9 | 695 | 38.91s |\r\n| 8.12.6 | 8.15.9 | 947 | 51.13s |\r\n| 8.13.11 | 8.15.9 | 646 | 42.98s |\r\n\r\n- Given the positive results for much bigger payloads seen in the\r\n**Memory profiling with limited memory in Kibana production mode**\r\nbelow, we can assume that there's no risk of OOMs in Serverless at the\r\nmoment.\r\n\r\n### Memory profiling with limited memory in Kibana production mode\r\n\r\n- Launched Kibana in Production mode, and set a **memory limit of\r\n700mb** to mimic as closely as possible the Serverless environment\r\n(where memory is a hard constraint)\r\n- Stress tested with big number of requests and saw the following\r\nbehaviour:\r\n\r\n| Rule Updates | Request time (min) | OOM error? | Metrics |\r\n|---------|--------|--------|--------|\r\n| 1500 | 1.1 | No |\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/46303a1a-a929-4c00-8777-8d1f23face17)</details>\r\n|\r\n| 2000 | 1.5 | No |\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/bd33d259-50fd-42df-947d-3a2e7c5c78c3)</details>\r\n|\r\n| 2500 | 1.8 | No |\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9145d2e7-e87c-4ba6-8633-7fe1087c29fb)</details>\r\n|\r\n| 2750 | 1.9 | No |\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9009163e-f58d-4be3-8a1f-87844760a037)</details>\r\n|\r\n| 3000 | - | YES | |\r\n\r\n- Rule upgrade OOM's consistently when the payload is >= 3000 rules, but\r\nbehaves reliably below that. Good enough buffer for growth of the\r\nPrebuilt Rules package.\r\n- Also, the saw-toothed shape of the heap used graphics shows that\r\ngarbage collection works properly for payloads under 3000 rules.\r\n\r\n### APM request profiling\r\n\r\n- Connected Kibana in production mode to a APM server to measure spans\r\nof the `/upgrade/_perform` request.\r\n- Additionally, measured new CPU-intensive logic which calculates rule\r\ndiffs and create rule payloads for upgrades.\r\n- An example span for a successful upgrade of 2500 rules:\r\n<img width=\"1722\" alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/07aa3079-5ce4-4b87-ab41-2a3e133316ef\">\r\n- The new spans for CPU-intensive tasks `getUpgradeableRules` and\r\n`createModifiedPrebuiltRuleAssets`, which are displayed as `blocking`,\r\nhave an acceptable span length, and do not have a considerable overhead\r\non the total length of the request.\r\n\r\n### Timeout testing\r\n\r\n- Tested Kibana with `--no-base-path` in order to check for potential\r\ntimeouts in long running requests (ESS Cloud proxy is supposed to have a\r\nrequest timeout config of 2.5 mins)\r\n- Still\r\n[confirming](https://elastic.slack.com/archives/C5UDAFZQU/p1730297621776729)\r\nwith Kibana Operations the behaviour of the timeouts in ESS and\r\nServerless envs:\r\n- Tested with mock rules (indexed directly to ES) and **no timeouts\r\noccurred**:\r\n\r\n| Rule Updates | Request time (min) |\r\n|---------|--------|\r\n| 2000 | 2.1 |\r\n| 2000 | 2.1 |\r\n| 2250 | 2.3 |\r\n| 2500 | 2.6 |\r\n| 3000 | 3.1 |\r\n\r\n\r\n\r\n### Conclusion\r\n\r\nThe results show that the `/upgrade/_perform` endpoint performs reliably\r\nunder stress, given the currentexpected request payloads.\r\n\r\nThe only question to triple check is the behaviour of server request\r\ntimeouts in Serverless: I'm waiting the Kibana ops team to get back to\r\nme, even though testing here did not show cases of timeouts.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine <[email protected]>\r\nCo-authored-by: Dmitrii Shevchenko <[email protected]>","sha":"bdb6ff128c8e9fbf83d5e38e9a771853a171fdd2","branchLabelMapping":{"^v9.0.0$":"main","^v8.17.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","v9.0.0","Team:Detections and Resp","Team: SecuritySolution","Team:Detection Rule Management","Feature:Prebuilt Detection Rules","ci:project-deploy-security","ci:project-redeploy","backport:version","v8.17.0"],"title":"[Security Solution] `/upgrade/_perform` performance testing","number":197898,"url":"https://github.com/elastic/kibana/pull/197898","mergeCommit":{"message":"[Security Solution] `/upgrade/_perform` performance testing (#197898)\n\n## Summary\r\n\r\n- Creates a new `withSyncSecuritySpan` wrapper to measure sync functions\r\nin APM. Adds this wrapper to new CPU intensive logic in the\r\n`/upgrade/_perform` endpoint.\r\n- Do performance testing on the endpoint. See results below.\r\n\r\n\r\n## Performance testing\r\n\r\n### Possible OOMs in production Serverless\r\n\r\nCreated an Serverless instance and manually installed different Prebuilt\r\nRules package to force rule upgrades.\r\n- With the current published prebuilt packages, a user cannot update\r\nmore than 950 rules with a single request.\r\n- This number is expected to grow, but at a slower pace than the actual\r\nnumber of rules being published.\r\n- Also, as users start customizing rules, rules with conflicts will be\r\nexcluded from bulk requests, which will **make payloads even smaller.**\r\n- Testing the biggest possible upgrade request, Serverless behaved\r\nreliably and no **timeouts** or **OOMs** occurred:\r\n\r\n| From version | To version | Rule Updates | Request time |\r\n|---------|--------|---------|--------|\r\n| 8.9.9 | 8.15.9 | 913 | 47.3s |\r\n| 8.9.12 | 8.15.9 | 917 | 52.34s |\r\n| 8.9.15 | 8.15.9 | 928 | 56.08s |\r\n| 8.10.4 | 8.15.9 | 872 | 43.29s |\r\n| 8.10.5 | 8.15.9 | 910 | 52.21s |\r\n| 8.10.6 | 8.15.9 | 913 | 55.92s |\r\n| 8.10.7 | 8.15.9 | 924 | 49.89s |\r\n| 8.11.2 | 8.15.9 | 910 | 56.48s |\r\n| 8.11.5 | 8.15.9 | 928 | 49.22s |\r\n| 8.11.16 | 8.15.9 | 695 | 38.91s |\r\n| 8.12.6 | 8.15.9 | 947 | 51.13s |\r\n| 8.13.11 | 8.15.9 | 646 | 42.98s |\r\n\r\n- Given the positive results for much bigger payloads seen in the\r\n**Memory profiling with limited memory in Kibana production mode**\r\nbelow, we can assume that there's no risk of OOMs in Serverless at the\r\nmoment.\r\n\r\n### Memory profiling with limited memory in Kibana production mode\r\n\r\n- Launched Kibana in Production mode, and set a **memory limit of\r\n700mb** to mimic as closely as possible the Serverless environment\r\n(where memory is a hard constraint)\r\n- Stress tested with big number of requests and saw the following\r\nbehaviour:\r\n\r\n| Rule Updates | Request time (min) | OOM error? | Metrics |\r\n|---------|--------|--------|--------|\r\n| 1500 | 1.1 | No |\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/46303a1a-a929-4c00-8777-8d1f23face17)</details>\r\n|\r\n| 2000 | 1.5 | No |\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/bd33d259-50fd-42df-947d-3a2e7c5c78c3)</details>\r\n|\r\n| 2500 | 1.8 | No |\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9145d2e7-e87c-4ba6-8633-7fe1087c29fb)</details>\r\n|\r\n| 2750 | 1.9 | No |\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9009163e-f58d-4be3-8a1f-87844760a037)</details>\r\n|\r\n| 3000 | - | YES | |\r\n\r\n- Rule upgrade OOM's consistently when the payload is >= 3000 rules, but\r\nbehaves reliably below that. Good enough buffer for growth of the\r\nPrebuilt Rules package.\r\n- Also, the saw-toothed shape of the heap used graphics shows that\r\ngarbage collection works properly for payloads under 3000 rules.\r\n\r\n### APM request profiling\r\n\r\n- Connected Kibana in production mode to a APM server to measure spans\r\nof the `/upgrade/_perform` request.\r\n- Additionally, measured new CPU-intensive logic which calculates rule\r\ndiffs and create rule payloads for upgrades.\r\n- An example span for a successful upgrade of 2500 rules:\r\n<img width=\"1722\" alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/07aa3079-5ce4-4b87-ab41-2a3e133316ef\">\r\n- The new spans for CPU-intensive tasks `getUpgradeableRules` and\r\n`createModifiedPrebuiltRuleAssets`, which are displayed as `blocking`,\r\nhave an acceptable span length, and do not have a considerable overhead\r\non the total length of the request.\r\n\r\n### Timeout testing\r\n\r\n- Tested Kibana with `--no-base-path` in order to check for potential\r\ntimeouts in long running requests (ESS Cloud proxy is supposed to have a\r\nrequest timeout config of 2.5 mins)\r\n- Still\r\n[confirming](https://elastic.slack.com/archives/C5UDAFZQU/p1730297621776729)\r\nwith Kibana Operations the behaviour of the timeouts in ESS and\r\nServerless envs:\r\n- Tested with mock rules (indexed directly to ES) and **no timeouts\r\noccurred**:\r\n\r\n| Rule Updates | Request time (min) |\r\n|---------|--------|\r\n| 2000 | 2.1 |\r\n| 2000 | 2.1 |\r\n| 2250 | 2.3 |\r\n| 2500 | 2.6 |\r\n| 3000 | 3.1 |\r\n\r\n\r\n\r\n### Conclusion\r\n\r\nThe results show that the `/upgrade/_perform` endpoint performs reliably\r\nunder stress, given the currentexpected request payloads.\r\n\r\nThe only question to triple check is the behaviour of server request\r\ntimeouts in Serverless: I'm waiting the Kibana ops team to get back to\r\nme, even though testing here did not show cases of timeouts.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine <[email protected]>\r\nCo-authored-by: Dmitrii Shevchenko <[email protected]>","sha":"bdb6ff128c8e9fbf83d5e38e9a771853a171fdd2"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/197898","number":197898,"mergeCommit":{"message":"[Security Solution] `/upgrade/_perform` performance testing (#197898)\n\n## Summary\r\n\r\n- Creates a new `withSyncSecuritySpan` wrapper to measure sync functions\r\nin APM. Adds this wrapper to new CPU intensive logic in the\r\n`/upgrade/_perform` endpoint.\r\n- Do performance testing on the endpoint. See results below.\r\n\r\n\r\n## Performance testing\r\n\r\n### Possible OOMs in production Serverless\r\n\r\nCreated an Serverless instance and manually installed different Prebuilt\r\nRules package to force rule upgrades.\r\n- With the current published prebuilt packages, a user cannot update\r\nmore than 950 rules with a single request.\r\n- This number is expected to grow, but at a slower pace than the actual\r\nnumber of rules being published.\r\n- Also, as users start customizing rules, rules with conflicts will be\r\nexcluded from bulk requests, which will **make payloads even smaller.**\r\n- Testing the biggest possible upgrade request, Serverless behaved\r\nreliably and no **timeouts** or **OOMs** occurred:\r\n\r\n| From version | To version | Rule Updates | Request time |\r\n|---------|--------|---------|--------|\r\n| 8.9.9 | 8.15.9 | 913 | 47.3s |\r\n| 8.9.12 | 8.15.9 | 917 | 52.34s |\r\n| 8.9.15 | 8.15.9 | 928 | 56.08s |\r\n| 8.10.4 | 8.15.9 | 872 | 43.29s |\r\n| 8.10.5 | 8.15.9 | 910 | 52.21s |\r\n| 8.10.6 | 8.15.9 | 913 | 55.92s |\r\n| 8.10.7 | 8.15.9 | 924 | 49.89s |\r\n| 8.11.2 | 8.15.9 | 910 | 56.48s |\r\n| 8.11.5 | 8.15.9 | 928 | 49.22s |\r\n| 8.11.16 | 8.15.9 | 695 | 38.91s |\r\n| 8.12.6 | 8.15.9 | 947 | 51.13s |\r\n| 8.13.11 | 8.15.9 | 646 | 42.98s |\r\n\r\n- Given the positive results for much bigger payloads seen in the\r\n**Memory profiling with limited memory in Kibana production mode**\r\nbelow, we can assume that there's no risk of OOMs in Serverless at the\r\nmoment.\r\n\r\n### Memory profiling with limited memory in Kibana production mode\r\n\r\n- Launched Kibana in Production mode, and set a **memory limit of\r\n700mb** to mimic as closely as possible the Serverless environment\r\n(where memory is a hard constraint)\r\n- Stress tested with big number of requests and saw the following\r\nbehaviour:\r\n\r\n| Rule Updates | Request time (min) | OOM error? | Metrics |\r\n|---------|--------|--------|--------|\r\n| 1500 | 1.1 | No |\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/46303a1a-a929-4c00-8777-8d1f23face17)</details>\r\n|\r\n| 2000 | 1.5 | No |\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/bd33d259-50fd-42df-947d-3a2e7c5c78c3)</details>\r\n|\r\n| 2500 | 1.8 | No |\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9145d2e7-e87c-4ba6-8633-7fe1087c29fb)</details>\r\n|\r\n| 2750 | 1.9 | No |\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9009163e-f58d-4be3-8a1f-87844760a037)</details>\r\n|\r\n| 3000 | - | YES | |\r\n\r\n- Rule upgrade OOM's consistently when the payload is >= 3000 rules, but\r\nbehaves reliably below that. Good enough buffer for growth of the\r\nPrebuilt Rules package.\r\n- Also, the saw-toothed shape of the heap used graphics shows that\r\ngarbage collection works properly for payloads under 3000 rules.\r\n\r\n### APM request profiling\r\n\r\n- Connected Kibana in production mode to a APM server to measure spans\r\nof the `/upgrade/_perform` request.\r\n- Additionally, measured new CPU-intensive logic which calculates rule\r\ndiffs and create rule payloads for upgrades.\r\n- An example span for a successful upgrade of 2500 rules:\r\n<img width=\"1722\" alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/07aa3079-5ce4-4b87-ab41-2a3e133316ef\">\r\n- The new spans for CPU-intensive tasks `getUpgradeableRules` and\r\n`createModifiedPrebuiltRuleAssets`, which are displayed as `blocking`,\r\nhave an acceptable span length, and do not have a considerable overhead\r\non the total length of the request.\r\n\r\n### Timeout testing\r\n\r\n- Tested Kibana with `--no-base-path` in order to check for potential\r\ntimeouts in long running requests (ESS Cloud proxy is supposed to have a\r\nrequest timeout config of 2.5 mins)\r\n- Still\r\n[confirming](https://elastic.slack.com/archives/C5UDAFZQU/p1730297621776729)\r\nwith Kibana Operations the behaviour of the timeouts in ESS and\r\nServerless envs:\r\n- Tested with mock rules (indexed directly to ES) and **no timeouts\r\noccurred**:\r\n\r\n| Rule Updates | Request time (min) |\r\n|---------|--------|\r\n| 2000 | 2.1 |\r\n| 2000 | 2.1 |\r\n| 2250 | 2.3 |\r\n| 2500 | 2.6 |\r\n| 3000 | 3.1 |\r\n\r\n\r\n\r\n### Conclusion\r\n\r\nThe results show that the `/upgrade/_perform` endpoint performs reliably\r\nunder stress, given the currentexpected request payloads.\r\n\r\nThe only question to triple check is the behaviour of server request\r\ntimeouts in Serverless: I'm waiting the Kibana ops team to get back to\r\nme, even though testing here did not show cases of timeouts.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine <[email protected]>\r\nCo-authored-by: Dmitrii Shevchenko <[email protected]>","sha":"bdb6ff128c8e9fbf83d5e38e9a771853a171fdd2"}},{"branch":"8.x","label":"v8.17.0","branchLabelMappingKey":"^v8.17.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}] BACKPORT--> Co-authored-by: Juan Pablo Djeredjian <[email protected]>
…197898) ## Summary - Creates a new `withSyncSecuritySpan` wrapper to measure sync functions in APM. Adds this wrapper to new CPU intensive logic in the `/upgrade/_perform` endpoint. - Do performance testing on the endpoint. See results below. ## Performance testing ### Possible OOMs in production Serverless Created an Serverless instance and manually installed different Prebuilt Rules package to force rule upgrades. - With the current published prebuilt packages, a user cannot update more than 950 rules with a single request. - This number is expected to grow, but at a slower pace than the actual number of rules being published. - Also, as users start customizing rules, rules with conflicts will be excluded from bulk requests, which will **make payloads even smaller.** - Testing the biggest possible upgrade request, Serverless behaved reliably and no **timeouts** or **OOMs** occurred: | From version | To version | Rule Updates | Request time | |---------|--------|---------|--------| | 8.9.9 | 8.15.9 | 913 | 47.3s | | 8.9.12 | 8.15.9 | 917 | 52.34s | | 8.9.15 | 8.15.9 | 928 | 56.08s | | 8.10.4 | 8.15.9 | 872 | 43.29s | | 8.10.5 | 8.15.9 | 910 | 52.21s | | 8.10.6 | 8.15.9 | 913 | 55.92s | | 8.10.7 | 8.15.9 | 924 | 49.89s | | 8.11.2 | 8.15.9 | 910 | 56.48s | | 8.11.5 | 8.15.9 | 928 | 49.22s | | 8.11.16 | 8.15.9 | 695 | 38.91s | | 8.12.6 | 8.15.9 | 947 | 51.13s | | 8.13.11 | 8.15.9 | 646 | 42.98s | - Given the positive results for much bigger payloads seen in the **Memory profiling with limited memory in Kibana production mode** below, we can assume that there's no risk of OOMs in Serverless at the moment. ### Memory profiling with limited memory in Kibana production mode - Launched Kibana in Production mode, and set a **memory limit of 700mb** to mimic as closely as possible the Serverless environment (where memory is a hard constraint) - Stress tested with big number of requests and saw the following behaviour: | Rule Updates | Request time (min) | OOM error? | Metrics | |---------|--------|--------|--------| | 1500 | 1.1 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/46303a1a-a929-4c00-8777-8d1f23face17)</details> | | 2000 | 1.5 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/bd33d259-50fd-42df-947d-3a2e7c5c78c3)</details> | | 2500 | 1.8 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9145d2e7-e87c-4ba6-8633-7fe1087c29fb)</details> | | 2750 | 1.9 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9009163e-f58d-4be3-8a1f-87844760a037)</details> | | 3000 | - | YES | | - Rule upgrade OOM's consistently when the payload is >= 3000 rules, but behaves reliably below that. Good enough buffer for growth of the Prebuilt Rules package. - Also, the saw-toothed shape of the heap used graphics shows that garbage collection works properly for payloads under 3000 rules. ### APM request profiling - Connected Kibana in production mode to a APM server to measure spans of the `/upgrade/_perform` request. - Additionally, measured new CPU-intensive logic which calculates rule diffs and create rule payloads for upgrades. - An example span for a successful upgrade of 2500 rules: <img width="1722" alt="image" src="https://github.com/user-attachments/assets/07aa3079-5ce4-4b87-ab41-2a3e133316ef"> - The new spans for CPU-intensive tasks `getUpgradeableRules` and `createModifiedPrebuiltRuleAssets`, which are displayed as `blocking`, have an acceptable span length, and do not have a considerable overhead on the total length of the request. ### Timeout testing - Tested Kibana with `--no-base-path` in order to check for potential timeouts in long running requests (ESS Cloud proxy is supposed to have a request timeout config of 2.5 mins) - Still [confirming](https://elastic.slack.com/archives/C5UDAFZQU/p1730297621776729) with Kibana Operations the behaviour of the timeouts in ESS and Serverless envs: - Tested with mock rules (indexed directly to ES) and **no timeouts occurred**: | Rule Updates | Request time (min) | |---------|--------| | 2000 | 2.1 | | 2000 | 2.1 | | 2250 | 2.3 | | 2500 | 2.6 | | 3000 | 3.1 | ### Conclusion The results show that the `/upgrade/_perform` endpoint performs reliably under stress, given the currentexpected request payloads. The only question to triple check is the behaviour of server request timeouts in Serverless: I'm waiting the Kibana ops team to get back to me, even though testing here did not show cases of timeouts. --------- Co-authored-by: Elastic Machine <[email protected]> Co-authored-by: Dmitrii Shevchenko <[email protected]>
Summary
withSyncSecuritySpan
wrapper to measure sync functions in APM. Adds this wrapper to new CPU intensive logic in the/upgrade/_perform
endpoint.Performance testing
Possible OOMs in production Serverless
Created an Serverless instance and manually installed different Prebuilt Rules package to force rule upgrades.
Memory profiling with limited memory in Kibana production mode
Unfold
Unfold
Unfold
Unfold
APM request profiling
/upgrade/_perform
request.Timeout testing
--no-base-path
in order to check for potential timeouts in long running requests (ESS Cloud proxy is supposed to have a request timeout config of 2.5 mins)Conclusion
The results show that the
/upgrade/_perform
endpoint performs reliably under stress, given the currentexpected request payloads.The only question to triple check is the behaviour of server request timeouts in Serverless: I'm waiting the Kibana ops team to get back to me, even though testing here did not show cases of timeouts.