[Security Solution] `/upgrade/_perform` performance testing #197898

jpdjere · 2024-10-25T19:36:04Z

Summary

Creates a new withSyncSecuritySpan wrapper to measure sync functions in APM. Adds this wrapper to new CPU intensive logic in the /upgrade/_perform endpoint.
Do performance testing on the endpoint. See results below.

Performance testing

Possible OOMs in production Serverless

Created an Serverless instance and manually installed different Prebuilt Rules package to force rule upgrades.

With the current published prebuilt packages, a user cannot update more than 950 rules with a single request.
This number is expected to grow, but at a slower pace than the actual number of rules being published.
Also, as users start customizing rules, rules with conflicts will be excluded from bulk requests, which will make payloads even smaller.
Testing the biggest possible upgrade request, Serverless behaved reliably and no timeouts or OOMs occurred:

From version	To version	Rule Updates	Request time
8.9.9	8.15.9	913	47.3s
8.9.12	8.15.9	917	52.34s
8.9.15	8.15.9	928	56.08s
8.10.4	8.15.9	872	43.29s
8.10.5	8.15.9	910	52.21s
8.10.6	8.15.9	913	55.92s
8.10.7	8.15.9	924	49.89s
8.11.2	8.15.9	910	56.48s
8.11.5	8.15.9	928	49.22s
8.11.16	8.15.9	695	38.91s
8.12.6	8.15.9	947	51.13s
8.13.11	8.15.9	646	42.98s

Given the positive results for much bigger payloads seen in the Memory profiling with limited memory in Kibana production mode below, we can assume that there's no risk of OOMs in Serverless at the moment.

Memory profiling with limited memory in Kibana production mode

Launched Kibana in Production mode, and set a memory limit of 700mb to mimic as closely as possible the Serverless environment (where memory is a hard constraint)
Stress tested with big number of requests and saw the following behaviour:

Rule Updates	Request time (min)	OOM error?	Metrics
1500	1.1	No	Unfold
2000	1.5	No	Unfold
2500	1.8	No	Unfold
2750	1.9	No	Unfold
3000	-	YES

Rule upgrade OOM's consistently when the payload is >= 3000 rules, but behaves reliably below that. Good enough buffer for growth of the Prebuilt Rules package.
Also, the saw-toothed shape of the heap used graphics shows that garbage collection works properly for payloads under 3000 rules.

APM request profiling

Connected Kibana in production mode to a APM server to measure spans of the /upgrade/_perform request.
Additionally, measured new CPU-intensive logic which calculates rule diffs and create rule payloads for upgrades.
An example span for a successful upgrade of 2500 rules:

- The new spans for CPU-intensive tasks `getUpgradeableRules` and `createModifiedPrebuiltRuleAssets`, which are displayed as `blocking`, have an acceptable span length, and do not have a considerable overhead on the total length of the request.

Timeout testing

Tested Kibana with --no-base-path in order to check for potential timeouts in long running requests (ESS Cloud proxy is supposed to have a request timeout config of 2.5 mins)
Still confirming with Kibana Operations the behaviour of the timeouts in ESS and Serverless envs:
Tested with mock rules (indexed directly to ES) and no timeouts occurred:

Rule Updates	Request time (min)
2000	2.1
2000	2.1
2250	2.3
2500	2.6
3000	3.1

Conclusion

The results show that the /upgrade/_perform endpoint performs reliably under stress, given the currentexpected request payloads.

The only question to triple check is the behaviour of server request timeouts in Serverless: I'm waiting the Kibana ops team to get back to me, even though testing here did not show cases of timeouts.

elasticmachine · 2024-10-30T23:42:18Z

Pinging @elastic/security-detection-rule-management (Team:Detection Rule Management)

elasticmachine · 2024-10-30T23:42:18Z

Pinging @elastic/security-detections-response (Team:Detections and Resp)

elasticmachine · 2024-10-30T23:42:18Z

Pinging @elastic/security-solution (Team: SecuritySolution)

banderror · 2024-11-04T13:33:50Z

@elasticmachine merge upstream

banderror · 2024-11-04T14:38:02Z

@jpdjere Great work, thank you for describing in detail the testing you've done 👍

Some questions for discussing offline:

Can you elaborate on how exactly did you test "from version -> to version" upgrades in Serverless?
- Did you customize any prebuilt rules? If yes, how many?
"Memory profiling with limited memory in Kibana production mode" - why did you decide to "mimic" Serverless and do this locally?
- Can we do the same in real Serverless prod?
- It would be nice to use APM for measuring CPU and memory consumption, similar to how Dmitrii did it a few days ago for package installation.
"APM request profiling"
- How can we all access these profiles? Can you please share a link to a dashboard?
- I'm not sure I agree with your assessment that createModifiedPrebuiltRuleAssets has an acceptable duration. It's 1.3 seconds of blocking time, during which no other code can be run by the server. I think we should consider splitting the calculation into non-blocking async chunks (macrotasks).
- Have you tried to experiment with parallelism to see how that affects the total duration of an API call?
"Timeout testing"
- What about base path?

banderror

Just a few minor comments for the code changes.

...detection_engine/prebuilt_rules/api/perform_rule_upgrade/create_upgradeable_rules_payload.ts

banderror · 2024-11-04T14:49:53Z

...detection_engine/prebuilt_rules/api/perform_rule_upgrade/create_upgradeable_rules_payload.ts

+    const { modifiedPrebuiltRuleAssets, processingErrors } =
+      upgradeableRules.reduce<ProcessedRules>(
+        (processedRules, upgradeableRule) => {
+          const targetRuleType = upgradeableRule.target.type;


In a follow-up PR it would be great to try to refactor this function - the nesting here is already beyond manageable/readable.

...server/lib/detection_engine/prebuilt_rules/api/perform_rule_upgrade/get_upgradeable_rules.ts

x-pack/plugins/security_solution/server/utils/with_security_span.ts

jpdjere · 2024-11-05T18:31:06Z

Update on performance testing

Production Serverless stress testing

Created a new production Serverless instance https://rules-1500-ff576f.kb.us-east-1.aws.elastic.cloud/ and using Fleet's API uploaded locally generated Prebuilt Rules packages of increasing size. The process for each test was:

create N rules with a set of fields and version X
package them into a zip binary
upload the into the Serverless instance via the /api/fleet/epm/packages endpoint
install all rules
recreate N rules, with a different set of fields to generate diffs, and version X + 1
recreate zip and upload into the Serverless instance to trigger upgrade workflow for all rules
update all rules by clicking the Update All Rules button

Results:

Rule upgraded	Time taken by request (min)
1500	1.2
1750	1.4
2000	1.6
2500	1.9
2750	2.2
3000	2.4
3500	2.9
4000	3.1
6000	OOM

So results were even better than expected, given the previous testing running Kibana in production mode with limited memory (to mimic Serverless).

The request timeout of 60s or even 2.5min does not apply in Serverless.
OOM of memory issues only appeared at package sized vastly greater to what Kibana will face in the medium or even long-term, given the current amount of rule updates that are possible now (~1000). Testing proved a package update of 4000 rules is possible, which gives us more than enough buffer for the future.

APM profiling

There seems to be some type of delay or issue with the /upgrade/_perform requests being reported in Production. They might become visible tomorrow (I've seen pretty big delays before) but there's also a chance that they are just not picked up correctly by the APM server.

LINK to Production APM, with filter for production project applied:

Anyways, since the PR /upgrade/_perform performance testing #197898 is not merged and released, the spans for the new blocking CPU-intensive logic spans will not be reported on the production APM server. Therefore, I created a PR Serverless deployment to have some insight on how these spans look like.

Project Id: 20d80f52fb1d5e448e918b9638d16c59d8
Span for a /upgrade/_perform transaction updating 3000 rules: LINK

The CPU intensive task of createPrebuiltRuleAssetsPayload took 2344ms, that for a 3000 rule upgrade request, is of about 1,6ms per rule. Testing shows this span increases linearly with the size of the payload, but it does not become a bottleneck for the request.

jpdjere · 2024-11-05T20:59:32Z

/redeploy

elasticmachine · 2024-11-06T11:03:38Z

💚 Build Succeeded

Buildkite Build
Commit: 1fd67b4
Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-197898-1fd67b44a6e4
Security Deployment

Metrics [docs]

✅ unchanged

History

💔 Build #248677 failed f52c11a
💚 Build #248650 succeeded 9e47fbc
💚 Build #248608 succeeded 8e80eb3
💚 Build #248555 succeeded b602de8

cc @jpdjere

xcrzx

Thanks for the detailed performance testing, @jpdjere 🚀

The test results look good to me. My main concern was the potential for request timeouts during large rule upgrades, but the tests confirm this isn’t an issue. While the total time to upgrade 1,000 rules is a bit on the high side (~60 seconds), I don't think immediate action is necessary, given that rule upgrades should happen relatively infrequently—probably only a few times per month.

The CPU blocking time of ~700 ms per 1,000 rules also doesn’t seem likely to impact Kibana’s overall performance, especially since it’s an infrequent operation.

That said, as we discussed, it would be good to keep a couple of performance improvement items on the list to tackle after the initial release. Added a ticket to the Milestone 4: #199101

banderror

The results look great! Now I feel quite confident about the performance of the upgrade workflow, it doesn't look like we're going to get timeouts or OOMs any time soon.

Also, thank you for posting the detailed description, sharing the links to dashboards and addressing the comments. LGTM 👍 🚀

kibanamachine · 2024-11-06T12:51:05Z

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/11703902858

…197898) ## Summary - Creates a new `withSyncSecuritySpan` wrapper to measure sync functions in APM. Adds this wrapper to new CPU intensive logic in the `/upgrade/_perform` endpoint. - Do performance testing on the endpoint. See results below. ## Performance testing ### Possible OOMs in production Serverless Created an Serverless instance and manually installed different Prebuilt Rules package to force rule upgrades. - With the current published prebuilt packages, a user cannot update more than 950 rules with a single request. - This number is expected to grow, but at a slower pace than the actual number of rules being published. - Also, as users start customizing rules, rules with conflicts will be excluded from bulk requests, which will **make payloads even smaller.** - Testing the biggest possible upgrade request, Serverless behaved reliably and no **timeouts** or **OOMs** occurred: | From version | To version | Rule Updates | Request time | |---------|--------|---------|--------| | 8.9.9 | 8.15.9 | 913 | 47.3s | | 8.9.12 | 8.15.9 | 917 | 52.34s | | 8.9.15 | 8.15.9 | 928 | 56.08s | | 8.10.4 | 8.15.9 | 872 | 43.29s | | 8.10.5 | 8.15.9 | 910 | 52.21s | | 8.10.6 | 8.15.9 | 913 | 55.92s | | 8.10.7 | 8.15.9 | 924 | 49.89s | | 8.11.2 | 8.15.9 | 910 | 56.48s | | 8.11.5 | 8.15.9 | 928 | 49.22s | | 8.11.16 | 8.15.9 | 695 | 38.91s | | 8.12.6 | 8.15.9 | 947 | 51.13s | | 8.13.11 | 8.15.9 | 646 | 42.98s | - Given the positive results for much bigger payloads seen in the **Memory profiling with limited memory in Kibana production mode** below, we can assume that there's no risk of OOMs in Serverless at the moment. ### Memory profiling with limited memory in Kibana production mode - Launched Kibana in Production mode, and set a **memory limit of 700mb** to mimic as closely as possible the Serverless environment (where memory is a hard constraint) - Stress tested with big number of requests and saw the following behaviour: | Rule Updates | Request time (min) | OOM error? | Metrics | |---------|--------|--------|--------| | 1500 | 1.1 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/46303a1a-a929-4c00-8777-8d1f23face17)</details> | | 2000 | 1.5 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/bd33d259-50fd-42df-947d-3a2e7c5c78c3)</details> | | 2500 | 1.8 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9145d2e7-e87c-4ba6-8633-7fe1087c29fb)</details> | | 2750 | 1.9 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9009163e-f58d-4be3-8a1f-87844760a037)</details> | | 3000 | - | YES | | - Rule upgrade OOM's consistently when the payload is >= 3000 rules, but behaves reliably below that. Good enough buffer for growth of the Prebuilt Rules package. - Also, the saw-toothed shape of the heap used graphics shows that garbage collection works properly for payloads under 3000 rules. ### APM request profiling - Connected Kibana in production mode to a APM server to measure spans of the `/upgrade/_perform` request. - Additionally, measured new CPU-intensive logic which calculates rule diffs and create rule payloads for upgrades. - An example span for a successful upgrade of 2500 rules: <img width="1722" alt="image" src="https://github.com/user-attachments/assets/07aa3079-5ce4-4b87-ab41-2a3e133316ef"> - The new spans for CPU-intensive tasks `getUpgradeableRules` and `createModifiedPrebuiltRuleAssets`, which are displayed as `blocking`, have an acceptable span length, and do not have a considerable overhead on the total length of the request. ### Timeout testing - Tested Kibana with `--no-base-path` in order to check for potential timeouts in long running requests (ESS Cloud proxy is supposed to have a request timeout config of 2.5 mins) - Still [confirming](https://elastic.slack.com/archives/C5UDAFZQU/p1730297621776729) with Kibana Operations the behaviour of the timeouts in ESS and Serverless envs: - Tested with mock rules (indexed directly to ES) and **no timeouts occurred**: | Rule Updates | Request time (min) | |---------|--------| | 2000 | 2.1 | | 2000 | 2.1 | | 2250 | 2.3 | | 2500 | 2.6 | | 3000 | 3.1 | ### Conclusion The results show that the `/upgrade/_perform` endpoint performs reliably under stress, given the currentexpected request payloads. The only question to triple check is the behaviour of server request timeouts in Serverless: I'm waiting the Kibana ops team to get back to me, even though testing here did not show cases of timeouts. --------- Co-authored-by: Elastic Machine <[email protected]> Co-authored-by: Dmitrii Shevchenko <[email protected]> (cherry picked from commit bdb6ff1)

kibanamachine · 2024-11-06T12:55:40Z

💚 All backports created successfully

Status	Branch	Result
✅	8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

…esting (#197898) (#199128) # Backport This will backport the following commits from `main` to `8.x`: - [[Security Solution] `/upgrade/_perform` performance testing (#197898)](#197898)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Juan Pablo Djeredjian <[email protected]>

…197898) ## Summary - Creates a new `withSyncSecuritySpan` wrapper to measure sync functions in APM. Adds this wrapper to new CPU intensive logic in the `/upgrade/_perform` endpoint. - Do performance testing on the endpoint. See results below. ## Performance testing ### Possible OOMs in production Serverless Created an Serverless instance and manually installed different Prebuilt Rules package to force rule upgrades. - With the current published prebuilt packages, a user cannot update more than 950 rules with a single request. - This number is expected to grow, but at a slower pace than the actual number of rules being published. - Also, as users start customizing rules, rules with conflicts will be excluded from bulk requests, which will **make payloads even smaller.** - Testing the biggest possible upgrade request, Serverless behaved reliably and no **timeouts** or **OOMs** occurred: | From version | To version | Rule Updates | Request time | |---------|--------|---------|--------| | 8.9.9 | 8.15.9 | 913 | 47.3s | | 8.9.12 | 8.15.9 | 917 | 52.34s | | 8.9.15 | 8.15.9 | 928 | 56.08s | | 8.10.4 | 8.15.9 | 872 | 43.29s | | 8.10.5 | 8.15.9 | 910 | 52.21s | | 8.10.6 | 8.15.9 | 913 | 55.92s | | 8.10.7 | 8.15.9 | 924 | 49.89s | | 8.11.2 | 8.15.9 | 910 | 56.48s | | 8.11.5 | 8.15.9 | 928 | 49.22s | | 8.11.16 | 8.15.9 | 695 | 38.91s | | 8.12.6 | 8.15.9 | 947 | 51.13s | | 8.13.11 | 8.15.9 | 646 | 42.98s | - Given the positive results for much bigger payloads seen in the **Memory profiling with limited memory in Kibana production mode** below, we can assume that there's no risk of OOMs in Serverless at the moment. ### Memory profiling with limited memory in Kibana production mode - Launched Kibana in Production mode, and set a **memory limit of 700mb** to mimic as closely as possible the Serverless environment (where memory is a hard constraint) - Stress tested with big number of requests and saw the following behaviour: | Rule Updates | Request time (min) | OOM error? | Metrics | |---------|--------|--------|--------| | 1500 | 1.1 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/46303a1a-a929-4c00-8777-8d1f23face17)</details> | | 2000 | 1.5 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/bd33d259-50fd-42df-947d-3a2e7c5c78c3)</details> | | 2500 | 1.8 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9145d2e7-e87c-4ba6-8633-7fe1087c29fb)</details> | | 2750 | 1.9 | No | <details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9009163e-f58d-4be3-8a1f-87844760a037)</details> | | 3000 | - | YES | | - Rule upgrade OOM's consistently when the payload is >= 3000 rules, but behaves reliably below that. Good enough buffer for growth of the Prebuilt Rules package. - Also, the saw-toothed shape of the heap used graphics shows that garbage collection works properly for payloads under 3000 rules. ### APM request profiling - Connected Kibana in production mode to a APM server to measure spans of the `/upgrade/_perform` request. - Additionally, measured new CPU-intensive logic which calculates rule diffs and create rule payloads for upgrades. - An example span for a successful upgrade of 2500 rules: <img width="1722" alt="image" src="https://github.com/user-attachments/assets/07aa3079-5ce4-4b87-ab41-2a3e133316ef"> - The new spans for CPU-intensive tasks `getUpgradeableRules` and `createModifiedPrebuiltRuleAssets`, which are displayed as `blocking`, have an acceptable span length, and do not have a considerable overhead on the total length of the request. ### Timeout testing - Tested Kibana with `--no-base-path` in order to check for potential timeouts in long running requests (ESS Cloud proxy is supposed to have a request timeout config of 2.5 mins) - Still [confirming](https://elastic.slack.com/archives/C5UDAFZQU/p1730297621776729) with Kibana Operations the behaviour of the timeouts in ESS and Serverless envs: - Tested with mock rules (indexed directly to ES) and **no timeouts occurred**: | Rule Updates | Request time (min) | |---------|--------| | 2000 | 2.1 | | 2000 | 2.1 | | 2250 | 2.3 | | 2500 | 2.6 | | 3000 | 3.1 | ### Conclusion The results show that the `/upgrade/_perform` endpoint performs reliably under stress, given the currentexpected request payloads. The only question to triple check is the behaviour of server request timeouts in Serverless: I'm waiting the Kibana ops team to get back to me, even though testing here did not show cases of timeouts. --------- Co-authored-by: Elastic Machine <[email protected]> Co-authored-by: Dmitrii Shevchenko <[email protected]>

jpdjere added 2 commits October 24, 2024 17:55

Add sync security spans

212b172

Changes

6f10238

jpdjere changed the title ~~[Security Solution] /upgrade/_perform performance~~ [Security Solution] /upgrade/_perform performance testing Oct 30, 2024

jpdjere marked this pull request as ready for review October 30, 2024 23:41

jpdjere requested review from a team as code owners October 30, 2024 23:41

jpdjere requested a review from xcrzx October 30, 2024 23:41

Merge branch 'main' into perform_upgrade_performance

4b64b6c

jpdjere self-assigned this Oct 30, 2024

jpdjere added the Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area label Oct 30, 2024

Merge branch 'main' into perform_upgrade_performance

5c5c6ac

banderror requested review from banderror and marshallmain November 1, 2024 14:26

Merge branch 'main' into perform_upgrade_performance

e32cb89

banderror reviewed Nov 4, 2024

View reviewed changes

jpdjere added the ci:project-deploy-security Create a Security Serverless Project label Nov 5, 2024

Merge branch 'main' into perform_upgrade_performance

b602de8

Merge branch 'main' into perform_upgrade_performance

8e80eb3

jpdjere added the ci:project-redeploy Always create a new Cloud project label Nov 5, 2024

jpdjere added 2 commits November 5, 2024 18:48

Merge branch 'main' into perform_upgrade_performance

9e47fbc

Address comments

f52c11a

jpdjere requested a review from banderror November 5, 2024 23:22

Merge branch 'main' into perform_upgrade_performance

1fd67b4

xcrzx mentioned this pull request Nov 6, 2024

[Security Solution] /upgrade/_perform performance improvements #199101

Open

xcrzx approved these changes Nov 6, 2024

View reviewed changes

banderror added v9.0.0 backport:version Backport to applied version labels v8.17.0 and removed backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) labels Nov 6, 2024

banderror approved these changes Nov 6, 2024

View reviewed changes

banderror merged commit bdb6ff1 into elastic:main Nov 6, 2024
52 checks passed

kibanamachine mentioned this pull request Nov 6, 2024

[8.x] [Security Solution] `/upgrade/_perform` performance testing (#197898) #199128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security Solution] `/upgrade/_perform` performance testing #197898

[Security Solution] `/upgrade/_perform` performance testing #197898

jpdjere commented Oct 25, 2024 •

edited by kibanamachine

Loading

elasticmachine commented Oct 30, 2024

elasticmachine commented Oct 30, 2024

elasticmachine commented Oct 30, 2024

banderror commented Nov 4, 2024

banderror commented Nov 4, 2024

banderror left a comment

banderror Nov 4, 2024

jpdjere commented Nov 5, 2024 •

edited

Loading

jpdjere commented Nov 5, 2024

elasticmachine commented Nov 6, 2024 •

edited

Loading

xcrzx left a comment

banderror left a comment

kibanamachine commented Nov 6, 2024

kibanamachine commented Nov 6, 2024

[Security Solution] /upgrade/_perform performance testing #197898

[Security Solution] /upgrade/_perform performance testing #197898

Conversation

jpdjere commented Oct 25, 2024 • edited by kibanamachine Loading

Summary

Performance testing

Possible OOMs in production Serverless

Memory profiling with limited memory in Kibana production mode

APM request profiling

Timeout testing

Conclusion

elasticmachine commented Oct 30, 2024

elasticmachine commented Oct 30, 2024

elasticmachine commented Oct 30, 2024

banderror commented Nov 4, 2024

banderror commented Nov 4, 2024

banderror left a comment

Choose a reason for hiding this comment

banderror Nov 4, 2024

Choose a reason for hiding this comment

jpdjere commented Nov 5, 2024 • edited Loading

Update on performance testing

Production Serverless stress testing

APM profiling

jpdjere commented Nov 5, 2024

elasticmachine commented Nov 6, 2024 • edited Loading

💚 Build Succeeded

Metrics [docs]

History

xcrzx left a comment

Choose a reason for hiding this comment

banderror left a comment

Choose a reason for hiding this comment

kibanamachine commented Nov 6, 2024

kibanamachine commented Nov 6, 2024

💚 All backports created successfully

Questions ?

[Security Solution] `/upgrade/_perform` performance testing #197898

[Security Solution] `/upgrade/_perform` performance testing #197898

jpdjere commented Oct 25, 2024 •

edited by kibanamachine

Loading

jpdjere commented Nov 5, 2024 •

edited

Loading

elasticmachine commented Nov 6, 2024 •

edited

Loading