Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution] /upgrade/_perform performance testing #197898

Merged
merged 10 commits into from
Nov 6, 2024

Conversation

jpdjere
Copy link
Contributor

@jpdjere jpdjere commented Oct 25, 2024

Summary

  • Creates a new withSyncSecuritySpan wrapper to measure sync functions in APM. Adds this wrapper to new CPU intensive logic in the /upgrade/_perform endpoint.
  • Do performance testing on the endpoint. See results below.

Performance testing

Possible OOMs in production Serverless

Created an Serverless instance and manually installed different Prebuilt Rules package to force rule upgrades.

  • With the current published prebuilt packages, a user cannot update more than 950 rules with a single request.
  • This number is expected to grow, but at a slower pace than the actual number of rules being published.
  • Also, as users start customizing rules, rules with conflicts will be excluded from bulk requests, which will make payloads even smaller.
  • Testing the biggest possible upgrade request, Serverless behaved reliably and no timeouts or OOMs occurred:
From version To version Rule Updates Request time
8.9.9 8.15.9 913 47.3s
8.9.12 8.15.9 917 52.34s
8.9.15 8.15.9 928 56.08s
8.10.4 8.15.9 872 43.29s
8.10.5 8.15.9 910 52.21s
8.10.6 8.15.9 913 55.92s
8.10.7 8.15.9 924 49.89s
8.11.2 8.15.9 910 56.48s
8.11.5 8.15.9 928 49.22s
8.11.16 8.15.9 695 38.91s
8.12.6 8.15.9 947 51.13s
8.13.11 8.15.9 646 42.98s
  • Given the positive results for much bigger payloads seen in the Memory profiling with limited memory in Kibana production mode below, we can assume that there's no risk of OOMs in Serverless at the moment.

Memory profiling with limited memory in Kibana production mode

  • Launched Kibana in Production mode, and set a memory limit of 700mb to mimic as closely as possible the Serverless environment (where memory is a hard constraint)
  • Stress tested with big number of requests and saw the following behaviour:
Rule Updates Request time (min) OOM error? Metrics
1500 1.1 No
Unfoldimage
2000 1.5 No
Unfoldimage
2500 1.8 No
Unfoldimage
2750 1.9 No
Unfoldimage
3000 - YES
  • Rule upgrade OOM's consistently when the payload is >= 3000 rules, but behaves reliably below that. Good enough buffer for growth of the Prebuilt Rules package.
  • Also, the saw-toothed shape of the heap used graphics shows that garbage collection works properly for payloads under 3000 rules.

APM request profiling

  • Connected Kibana in production mode to a APM server to measure spans of the /upgrade/_perform request.
  • Additionally, measured new CPU-intensive logic which calculates rule diffs and create rule payloads for upgrades.
  • An example span for a successful upgrade of 2500 rules:
image - The new spans for CPU-intensive tasks `getUpgradeableRules` and `createModifiedPrebuiltRuleAssets`, which are displayed as `blocking`, have an acceptable span length, and do not have a considerable overhead on the total length of the request.

Timeout testing

  • Tested Kibana with --no-base-path in order to check for potential timeouts in long running requests (ESS Cloud proxy is supposed to have a request timeout config of 2.5 mins)
  • Still confirming with Kibana Operations the behaviour of the timeouts in ESS and Serverless envs:
  • Tested with mock rules (indexed directly to ES) and no timeouts occurred:
Rule Updates Request time (min)
2000 2.1
2000 2.1
2250 2.3
2500 2.6
3000 3.1

Conclusion

The results show that the /upgrade/_perform endpoint performs reliably under stress, given the currentexpected request payloads.

The only question to triple check is the behaviour of server request timeouts in Serverless: I'm waiting the Kibana ops team to get back to me, even though testing here did not show cases of timeouts.

@jpdjere jpdjere changed the title [Security Solution] /upgrade/_perform performance [Security Solution] /upgrade/_perform performance testing Oct 30, 2024
@jpdjere jpdjere marked this pull request as ready for review October 30, 2024 23:41
@jpdjere jpdjere requested review from a team as code owners October 30, 2024 23:41
@jpdjere jpdjere requested a review from xcrzx October 30, 2024 23:41
@jpdjere jpdjere self-assigned this Oct 30, 2024
@jpdjere jpdjere added release_note:skip Skip the PR/issue when compiling release notes Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Rule Management Security Detection Rule Management Team backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) labels Oct 30, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detection-rule-management (Team:Detection Rule Management)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@jpdjere jpdjere added the Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area label Oct 30, 2024
@banderror
Copy link
Contributor

@elasticmachine merge upstream

@banderror
Copy link
Contributor

@jpdjere Great work, thank you for describing in detail the testing you've done 👍

Some questions for discussing offline:

  • Can you elaborate on how exactly did you test "from version -> to version" upgrades in Serverless?
    • Did you customize any prebuilt rules? If yes, how many?
  • "Memory profiling with limited memory in Kibana production mode" - why did you decide to "mimic" Serverless and do this locally?
    • Can we do the same in real Serverless prod?
    • It would be nice to use APM for measuring CPU and memory consumption, similar to how Dmitrii did it a few days ago for package installation.
  • "APM request profiling"
    • How can we all access these profiles? Can you please share a link to a dashboard?
    • I'm not sure I agree with your assessment that createModifiedPrebuiltRuleAssets has an acceptable duration. It's 1.3 seconds of blocking time, during which no other code can be run by the server. I think we should consider splitting the calculation into non-blocking async chunks (macrotasks).
    • Have you tried to experiment with parallelism to see how that affects the total duration of an API call?
  • "Timeout testing"
    • What about base path?

Copy link
Contributor

@banderror banderror left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor comments for the code changes.

Comment on lines +43 to +46
const { modifiedPrebuiltRuleAssets, processingErrors } =
upgradeableRules.reduce<ProcessedRules>(
(processedRules, upgradeableRule) => {
const targetRuleType = upgradeableRule.target.type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow-up PR it would be great to try to refactor this function - the nesting here is already beyond manageable/readable.

@jpdjere jpdjere added the ci:project-deploy-security Create a Security Serverless Project label Nov 5, 2024
@jpdjere
Copy link
Contributor Author

jpdjere commented Nov 5, 2024

Update on performance testing

Production Serverless stress testing

Created a new production Serverless instance https://rules-1500-ff576f.kb.us-east-1.aws.elastic.cloud/ and using Fleet's API uploaded locally generated Prebuilt Rules packages of increasing size. The process for each test was:

  • create N rules with a set of fields and version X
  • package them into a zip binary
  • upload the into the Serverless instance via the /api/fleet/epm/packages endpoint
  • install all rules
  • recreate N rules, with a different set of fields to generate diffs, and version X + 1
  • recreate zip and upload into the Serverless instance to trigger upgrade workflow for all rules
  • update all rules by clicking the Update All Rules button

Results:

Rule upgraded Time taken by request (min)
1500 1.2
1750 1.4
2000 1.6
2500 1.9
2750 2.2
3000 2.4
3500 2.9
4000 3.1
6000 OOM

So results were even better than expected, given the previous testing running Kibana in production mode with limited memory (to mimic Serverless).

  • The request timeout of 60s or even 2.5min does not apply in Serverless.
  • OOM of memory issues only appeared at package sized vastly greater to what Kibana will face in the medium or even long-term, given the current amount of rule updates that are possible now (~1000). Testing proved a package update of 4000 rules is possible, which gives us more than enough buffer for the future.

APM profiling

There seems to be some type of delay or issue with the /upgrade/_perform requests being reported in Production. They might become visible tomorrow (I've seen pretty big delays before) but there's also a chance that they are just not picked up correctly by the APM server.

LINK to Production APM, with filter for production project applied:

Anyways, since the PR /upgrade/_perform performance testing #197898 is not merged and released, the spans for the new blocking CPU-intensive logic spans will not be reported on the production APM server. Therefore, I created a PR Serverless deployment to have some insight on how these spans look like.

Project Id: 20d80f52fb1d5e448e918b9638d16c59d8
Span for a /upgrade/_perform transaction updating 3000 rules: LINK

image

The CPU intensive task of createPrebuiltRuleAssetsPayload took 2344ms, that for a 3000 rule upgrade request, is of about 1,6ms per rule. Testing shows this span increases linearly with the size of the payload, but it does not become a bottleneck for the request.

@jpdjere
Copy link
Contributor Author

jpdjere commented Nov 5, 2024

/redeploy

@jpdjere jpdjere added the ci:project-redeploy Always create a new Cloud project label Nov 5, 2024
@jpdjere jpdjere requested a review from banderror November 5, 2024 23:22
@elasticmachine
Copy link
Contributor

elasticmachine commented Nov 6, 2024

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

cc @jpdjere

Copy link
Contributor

@xcrzx xcrzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed performance testing, @jpdjere 🚀

The test results look good to me. My main concern was the potential for request timeouts during large rule upgrades, but the tests confirm this isn’t an issue. While the total time to upgrade 1,000 rules is a bit on the high side (~60 seconds), I don't think immediate action is necessary, given that rule upgrades should happen relatively infrequently—probably only a few times per month.

The CPU blocking time of ~700 ms per 1,000 rules also doesn’t seem likely to impact Kibana’s overall performance, especially since it’s an infrequent operation.

That said, as we discussed, it would be good to keep a couple of performance improvement items on the list to tackle after the initial release. Added a ticket to the Milestone 4: #199101

@banderror banderror added v9.0.0 backport:version Backport to applied version labels v8.17.0 and removed backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) labels Nov 6, 2024
Copy link
Contributor

@banderror banderror left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results look great! Now I feel quite confident about the performance of the upgrade workflow, it doesn't look like we're going to get timeouts or OOMs any time soon.

Also, thank you for posting the detailed description, sharing the links to dashboards and addressing the comments. LGTM 👍 🚀

@banderror banderror merged commit bdb6ff1 into elastic:main Nov 6, 2024
52 checks passed
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/11703902858

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Nov 6, 2024
…197898)

## Summary

- Creates a new `withSyncSecuritySpan` wrapper to measure sync functions
in APM. Adds this wrapper to new CPU intensive logic in the
`/upgrade/_perform` endpoint.
- Do performance testing on the endpoint. See results below.

## Performance testing

### Possible OOMs in production Serverless

Created an Serverless instance and manually installed different Prebuilt
Rules package to force rule upgrades.
- With the current published prebuilt packages, a user cannot update
more than 950 rules with a single request.
- This number is expected to grow, but at a slower pace than the actual
number of rules being published.
- Also, as users start customizing rules, rules with conflicts will be
excluded from bulk requests, which will **make payloads even smaller.**
- Testing the biggest possible upgrade request, Serverless behaved
reliably and no **timeouts** or **OOMs** occurred:

| From version   | To version | Rule Updates | Request time   |
|---------|--------|---------|--------|
| 8.9.9   | 8.15.9 | 913     | 47.3s  |
| 8.9.12  | 8.15.9 | 917     | 52.34s |
| 8.9.15  | 8.15.9 | 928     | 56.08s |
| 8.10.4  | 8.15.9 | 872     | 43.29s |
| 8.10.5  | 8.15.9 | 910     | 52.21s |
| 8.10.6  | 8.15.9 | 913     | 55.92s |
| 8.10.7  | 8.15.9 | 924     | 49.89s |
| 8.11.2  | 8.15.9 | 910     | 56.48s |
| 8.11.5  | 8.15.9 | 928     | 49.22s |
| 8.11.16 | 8.15.9 | 695     | 38.91s |
| 8.12.6  | 8.15.9 | 947     | 51.13s |
| 8.13.11 | 8.15.9 | 646     | 42.98s |

- Given the positive results for much bigger payloads seen in the
**Memory profiling with limited memory in Kibana production mode**
below, we can assume that there's no risk of OOMs in Serverless at the
moment.

### Memory profiling with limited memory in Kibana production mode

- Launched Kibana in Production mode, and set a **memory limit of
700mb** to mimic as closely as possible the Serverless environment
(where memory is a hard constraint)
- Stress tested with big number of requests and saw the following
behaviour:

| Rule Updates   | Request time (min) | OOM error? | Metrics |
|---------|--------|--------|--------|
| 1500 | 1.1 | No |
<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/46303a1a-a929-4c00-8777-8d1f23face17)</details>
|
| 2000 | 1.5 | No |
<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/bd33d259-50fd-42df-947d-3a2e7c5c78c3)</details>
|
| 2500 | 1.8 | No |
<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9145d2e7-e87c-4ba6-8633-7fe1087c29fb)</details>
|
| 2750 | 1.9 | No |
<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9009163e-f58d-4be3-8a1f-87844760a037)</details>
|
| 3000  | - | YES |  |

- Rule upgrade OOM's consistently when the payload is >= 3000 rules, but
behaves reliably below that. Good enough buffer for growth of the
Prebuilt Rules package.
- Also, the saw-toothed shape of the heap used graphics shows that
garbage collection works properly for payloads under 3000 rules.

### APM request profiling

- Connected Kibana in production mode to a APM server to measure spans
of the `/upgrade/_perform` request.
- Additionally, measured new CPU-intensive logic which calculates rule
diffs and create rule payloads for upgrades.
- An example span for a successful upgrade of 2500 rules:
<img width="1722" alt="image"
src="https://github.com/user-attachments/assets/07aa3079-5ce4-4b87-ab41-2a3e133316ef">
- The new spans for CPU-intensive tasks `getUpgradeableRules` and
`createModifiedPrebuiltRuleAssets`, which are displayed as `blocking`,
have an acceptable span length, and do not have a considerable overhead
on the total length of the request.

### Timeout testing

- Tested Kibana with `--no-base-path` in order to check for potential
timeouts in long running requests (ESS Cloud proxy is supposed to have a
request timeout config of 2.5 mins)
- Still
[confirming](https://elastic.slack.com/archives/C5UDAFZQU/p1730297621776729)
with Kibana Operations the behaviour of the timeouts in ESS and
Serverless envs:
- Tested with mock rules (indexed directly to ES) and **no timeouts
occurred**:

| Rule Updates   | Request time (min) |
|---------|--------|
| 2000  | 2.1 |
| 2000  | 2.1 |
| 2250  | 2.3 |
| 2500  | 2.6 |
| 3000  | 3.1 |

### Conclusion

The results show that the `/upgrade/_perform` endpoint performs reliably
under stress, given the currentexpected request payloads.

The only question to triple check is the behaviour of server request
timeouts in Serverless: I'm waiting the Kibana ops team to get back to
me, even though testing here did not show cases of timeouts.

---------

Co-authored-by: Elastic Machine <[email protected]>
Co-authored-by: Dmitrii Shevchenko <[email protected]>
(cherry picked from commit bdb6ff1)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Nov 6, 2024
…esting (#197898) (#199128)

# Backport

This will backport the following commits from `main` to `8.x`:
- [[Security Solution] &#x60;/upgrade/_perform&#x60; performance testing
(#197898)](#197898)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Juan Pablo
Djeredjian","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-11-06T12:50:44Z","message":"[Security
Solution] `/upgrade/_perform` performance testing (#197898)\n\n##
Summary\r\n\r\n- Creates a new `withSyncSecuritySpan` wrapper to measure
sync functions\r\nin APM. Adds this wrapper to new CPU intensive logic
in the\r\n`/upgrade/_perform` endpoint.\r\n- Do performance testing on
the endpoint. See results below.\r\n\r\n\r\n## Performance
testing\r\n\r\n### Possible OOMs in production Serverless\r\n\r\nCreated
an Serverless instance and manually installed different
Prebuilt\r\nRules package to force rule upgrades.\r\n- With the current
published prebuilt packages, a user cannot update\r\nmore than 950 rules
with a single request.\r\n- This number is expected to grow, but at a
slower pace than the actual\r\nnumber of rules being published.\r\n-
Also, as users start customizing rules, rules with conflicts will
be\r\nexcluded from bulk requests, which will **make payloads even
smaller.**\r\n- Testing the biggest possible upgrade request, Serverless
behaved\r\nreliably and no **timeouts** or **OOMs** occurred:\r\n\r\n|
From version | To version | Rule Updates | Request time
|\r\n|---------|--------|---------|--------|\r\n| 8.9.9 | 8.15.9 | 913 |
47.3s |\r\n| 8.9.12 | 8.15.9 | 917 | 52.34s |\r\n| 8.9.15 | 8.15.9 | 928
| 56.08s |\r\n| 8.10.4 | 8.15.9 | 872 | 43.29s |\r\n| 8.10.5 | 8.15.9 |
910 | 52.21s |\r\n| 8.10.6 | 8.15.9 | 913 | 55.92s |\r\n| 8.10.7 |
8.15.9 | 924 | 49.89s |\r\n| 8.11.2 | 8.15.9 | 910 | 56.48s |\r\n|
8.11.5 | 8.15.9 | 928 | 49.22s |\r\n| 8.11.16 | 8.15.9 | 695 | 38.91s
|\r\n| 8.12.6 | 8.15.9 | 947 | 51.13s |\r\n| 8.13.11 | 8.15.9 | 646 |
42.98s |\r\n\r\n- Given the positive results for much bigger payloads
seen in the\r\n**Memory profiling with limited memory in Kibana
production mode**\r\nbelow, we can assume that there's no risk of OOMs
in Serverless at the\r\nmoment.\r\n\r\n### Memory profiling with limited
memory in Kibana production mode\r\n\r\n- Launched Kibana in Production
mode, and set a **memory limit of\r\n700mb** to mimic as closely as
possible the Serverless environment\r\n(where memory is a hard
constraint)\r\n- Stress tested with big number of requests and saw the
following\r\nbehaviour:\r\n\r\n| Rule Updates | Request time (min) | OOM
error? | Metrics |\r\n|---------|--------|--------|--------|\r\n| 1500 |
1.1 | No
|\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/46303a1a-a929-4c00-8777-8d1f23face17)</details>\r\n|\r\n|
2000 | 1.5 | No
|\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/bd33d259-50fd-42df-947d-3a2e7c5c78c3)</details>\r\n|\r\n|
2500 | 1.8 | No
|\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9145d2e7-e87c-4ba6-8633-7fe1087c29fb)</details>\r\n|\r\n|
2750 | 1.9 | No
|\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9009163e-f58d-4be3-8a1f-87844760a037)</details>\r\n|\r\n|
3000 | - | YES | |\r\n\r\n- Rule upgrade OOM's consistently when the
payload is >= 3000 rules, but\r\nbehaves reliably below that. Good
enough buffer for growth of the\r\nPrebuilt Rules package.\r\n- Also,
the saw-toothed shape of the heap used graphics shows that\r\ngarbage
collection works properly for payloads under 3000 rules.\r\n\r\n### APM
request profiling\r\n\r\n- Connected Kibana in production mode to a APM
server to measure spans\r\nof the `/upgrade/_perform` request.\r\n-
Additionally, measured new CPU-intensive logic which calculates
rule\r\ndiffs and create rule payloads for upgrades.\r\n- An example
span for a successful upgrade of 2500 rules:\r\n<img width=\"1722\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/07aa3079-5ce4-4b87-ab41-2a3e133316ef\">\r\n-
The new spans for CPU-intensive tasks `getUpgradeableRules`
and\r\n`createModifiedPrebuiltRuleAssets`, which are displayed as
`blocking`,\r\nhave an acceptable span length, and do not have a
considerable overhead\r\non the total length of the request.\r\n\r\n###
Timeout testing\r\n\r\n- Tested Kibana with `--no-base-path` in order to
check for potential\r\ntimeouts in long running requests (ESS Cloud
proxy is supposed to have a\r\nrequest timeout config of 2.5 mins)\r\n-
Still\r\n[confirming](https://elastic.slack.com/archives/C5UDAFZQU/p1730297621776729)\r\nwith
Kibana Operations the behaviour of the timeouts in ESS and\r\nServerless
envs:\r\n- Tested with mock rules (indexed directly to ES) and **no
timeouts\r\noccurred**:\r\n\r\n| Rule Updates | Request time (min)
|\r\n|---------|--------|\r\n| 2000 | 2.1 |\r\n| 2000 | 2.1 |\r\n| 2250
| 2.3 |\r\n| 2500 | 2.6 |\r\n| 3000 | 3.1 |\r\n\r\n\r\n\r\n###
Conclusion\r\n\r\nThe results show that the `/upgrade/_perform` endpoint
performs reliably\r\nunder stress, given the currentexpected request
payloads.\r\n\r\nThe only question to triple check is the behaviour of
server request\r\ntimeouts in Serverless: I'm waiting the Kibana ops
team to get back to\r\nme, even though testing here did not show cases
of timeouts.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<[email protected]>\r\nCo-authored-by: Dmitrii
Shevchenko
<[email protected]>","sha":"bdb6ff128c8e9fbf83d5e38e9a771853a171fdd2","branchLabelMapping":{"^v9.0.0$":"main","^v8.17.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","v9.0.0","Team:Detections
and Resp","Team: SecuritySolution","Team:Detection Rule
Management","Feature:Prebuilt Detection
Rules","ci:project-deploy-security","ci:project-redeploy","backport:version","v8.17.0"],"title":"[Security
Solution] `/upgrade/_perform` performance
testing","number":197898,"url":"https://github.com/elastic/kibana/pull/197898","mergeCommit":{"message":"[Security
Solution] `/upgrade/_perform` performance testing (#197898)\n\n##
Summary\r\n\r\n- Creates a new `withSyncSecuritySpan` wrapper to measure
sync functions\r\nin APM. Adds this wrapper to new CPU intensive logic
in the\r\n`/upgrade/_perform` endpoint.\r\n- Do performance testing on
the endpoint. See results below.\r\n\r\n\r\n## Performance
testing\r\n\r\n### Possible OOMs in production Serverless\r\n\r\nCreated
an Serverless instance and manually installed different
Prebuilt\r\nRules package to force rule upgrades.\r\n- With the current
published prebuilt packages, a user cannot update\r\nmore than 950 rules
with a single request.\r\n- This number is expected to grow, but at a
slower pace than the actual\r\nnumber of rules being published.\r\n-
Also, as users start customizing rules, rules with conflicts will
be\r\nexcluded from bulk requests, which will **make payloads even
smaller.**\r\n- Testing the biggest possible upgrade request, Serverless
behaved\r\nreliably and no **timeouts** or **OOMs** occurred:\r\n\r\n|
From version | To version | Rule Updates | Request time
|\r\n|---------|--------|---------|--------|\r\n| 8.9.9 | 8.15.9 | 913 |
47.3s |\r\n| 8.9.12 | 8.15.9 | 917 | 52.34s |\r\n| 8.9.15 | 8.15.9 | 928
| 56.08s |\r\n| 8.10.4 | 8.15.9 | 872 | 43.29s |\r\n| 8.10.5 | 8.15.9 |
910 | 52.21s |\r\n| 8.10.6 | 8.15.9 | 913 | 55.92s |\r\n| 8.10.7 |
8.15.9 | 924 | 49.89s |\r\n| 8.11.2 | 8.15.9 | 910 | 56.48s |\r\n|
8.11.5 | 8.15.9 | 928 | 49.22s |\r\n| 8.11.16 | 8.15.9 | 695 | 38.91s
|\r\n| 8.12.6 | 8.15.9 | 947 | 51.13s |\r\n| 8.13.11 | 8.15.9 | 646 |
42.98s |\r\n\r\n- Given the positive results for much bigger payloads
seen in the\r\n**Memory profiling with limited memory in Kibana
production mode**\r\nbelow, we can assume that there's no risk of OOMs
in Serverless at the\r\nmoment.\r\n\r\n### Memory profiling with limited
memory in Kibana production mode\r\n\r\n- Launched Kibana in Production
mode, and set a **memory limit of\r\n700mb** to mimic as closely as
possible the Serverless environment\r\n(where memory is a hard
constraint)\r\n- Stress tested with big number of requests and saw the
following\r\nbehaviour:\r\n\r\n| Rule Updates | Request time (min) | OOM
error? | Metrics |\r\n|---------|--------|--------|--------|\r\n| 1500 |
1.1 | No
|\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/46303a1a-a929-4c00-8777-8d1f23face17)</details>\r\n|\r\n|
2000 | 1.5 | No
|\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/bd33d259-50fd-42df-947d-3a2e7c5c78c3)</details>\r\n|\r\n|
2500 | 1.8 | No
|\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9145d2e7-e87c-4ba6-8633-7fe1087c29fb)</details>\r\n|\r\n|
2750 | 1.9 | No
|\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9009163e-f58d-4be3-8a1f-87844760a037)</details>\r\n|\r\n|
3000 | - | YES | |\r\n\r\n- Rule upgrade OOM's consistently when the
payload is >= 3000 rules, but\r\nbehaves reliably below that. Good
enough buffer for growth of the\r\nPrebuilt Rules package.\r\n- Also,
the saw-toothed shape of the heap used graphics shows that\r\ngarbage
collection works properly for payloads under 3000 rules.\r\n\r\n### APM
request profiling\r\n\r\n- Connected Kibana in production mode to a APM
server to measure spans\r\nof the `/upgrade/_perform` request.\r\n-
Additionally, measured new CPU-intensive logic which calculates
rule\r\ndiffs and create rule payloads for upgrades.\r\n- An example
span for a successful upgrade of 2500 rules:\r\n<img width=\"1722\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/07aa3079-5ce4-4b87-ab41-2a3e133316ef\">\r\n-
The new spans for CPU-intensive tasks `getUpgradeableRules`
and\r\n`createModifiedPrebuiltRuleAssets`, which are displayed as
`blocking`,\r\nhave an acceptable span length, and do not have a
considerable overhead\r\non the total length of the request.\r\n\r\n###
Timeout testing\r\n\r\n- Tested Kibana with `--no-base-path` in order to
check for potential\r\ntimeouts in long running requests (ESS Cloud
proxy is supposed to have a\r\nrequest timeout config of 2.5 mins)\r\n-
Still\r\n[confirming](https://elastic.slack.com/archives/C5UDAFZQU/p1730297621776729)\r\nwith
Kibana Operations the behaviour of the timeouts in ESS and\r\nServerless
envs:\r\n- Tested with mock rules (indexed directly to ES) and **no
timeouts\r\noccurred**:\r\n\r\n| Rule Updates | Request time (min)
|\r\n|---------|--------|\r\n| 2000 | 2.1 |\r\n| 2000 | 2.1 |\r\n| 2250
| 2.3 |\r\n| 2500 | 2.6 |\r\n| 3000 | 3.1 |\r\n\r\n\r\n\r\n###
Conclusion\r\n\r\nThe results show that the `/upgrade/_perform` endpoint
performs reliably\r\nunder stress, given the currentexpected request
payloads.\r\n\r\nThe only question to triple check is the behaviour of
server request\r\ntimeouts in Serverless: I'm waiting the Kibana ops
team to get back to\r\nme, even though testing here did not show cases
of timeouts.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<[email protected]>\r\nCo-authored-by: Dmitrii
Shevchenko
<[email protected]>","sha":"bdb6ff128c8e9fbf83d5e38e9a771853a171fdd2"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/197898","number":197898,"mergeCommit":{"message":"[Security
Solution] `/upgrade/_perform` performance testing (#197898)\n\n##
Summary\r\n\r\n- Creates a new `withSyncSecuritySpan` wrapper to measure
sync functions\r\nin APM. Adds this wrapper to new CPU intensive logic
in the\r\n`/upgrade/_perform` endpoint.\r\n- Do performance testing on
the endpoint. See results below.\r\n\r\n\r\n## Performance
testing\r\n\r\n### Possible OOMs in production Serverless\r\n\r\nCreated
an Serverless instance and manually installed different
Prebuilt\r\nRules package to force rule upgrades.\r\n- With the current
published prebuilt packages, a user cannot update\r\nmore than 950 rules
with a single request.\r\n- This number is expected to grow, but at a
slower pace than the actual\r\nnumber of rules being published.\r\n-
Also, as users start customizing rules, rules with conflicts will
be\r\nexcluded from bulk requests, which will **make payloads even
smaller.**\r\n- Testing the biggest possible upgrade request, Serverless
behaved\r\nreliably and no **timeouts** or **OOMs** occurred:\r\n\r\n|
From version | To version | Rule Updates | Request time
|\r\n|---------|--------|---------|--------|\r\n| 8.9.9 | 8.15.9 | 913 |
47.3s |\r\n| 8.9.12 | 8.15.9 | 917 | 52.34s |\r\n| 8.9.15 | 8.15.9 | 928
| 56.08s |\r\n| 8.10.4 | 8.15.9 | 872 | 43.29s |\r\n| 8.10.5 | 8.15.9 |
910 | 52.21s |\r\n| 8.10.6 | 8.15.9 | 913 | 55.92s |\r\n| 8.10.7 |
8.15.9 | 924 | 49.89s |\r\n| 8.11.2 | 8.15.9 | 910 | 56.48s |\r\n|
8.11.5 | 8.15.9 | 928 | 49.22s |\r\n| 8.11.16 | 8.15.9 | 695 | 38.91s
|\r\n| 8.12.6 | 8.15.9 | 947 | 51.13s |\r\n| 8.13.11 | 8.15.9 | 646 |
42.98s |\r\n\r\n- Given the positive results for much bigger payloads
seen in the\r\n**Memory profiling with limited memory in Kibana
production mode**\r\nbelow, we can assume that there's no risk of OOMs
in Serverless at the\r\nmoment.\r\n\r\n### Memory profiling with limited
memory in Kibana production mode\r\n\r\n- Launched Kibana in Production
mode, and set a **memory limit of\r\n700mb** to mimic as closely as
possible the Serverless environment\r\n(where memory is a hard
constraint)\r\n- Stress tested with big number of requests and saw the
following\r\nbehaviour:\r\n\r\n| Rule Updates | Request time (min) | OOM
error? | Metrics |\r\n|---------|--------|--------|--------|\r\n| 1500 |
1.1 | No
|\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/46303a1a-a929-4c00-8777-8d1f23face17)</details>\r\n|\r\n|
2000 | 1.5 | No
|\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/bd33d259-50fd-42df-947d-3a2e7c5c78c3)</details>\r\n|\r\n|
2500 | 1.8 | No
|\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9145d2e7-e87c-4ba6-8633-7fe1087c29fb)</details>\r\n|\r\n|
2750 | 1.9 | No
|\r\n<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9009163e-f58d-4be3-8a1f-87844760a037)</details>\r\n|\r\n|
3000 | - | YES | |\r\n\r\n- Rule upgrade OOM's consistently when the
payload is >= 3000 rules, but\r\nbehaves reliably below that. Good
enough buffer for growth of the\r\nPrebuilt Rules package.\r\n- Also,
the saw-toothed shape of the heap used graphics shows that\r\ngarbage
collection works properly for payloads under 3000 rules.\r\n\r\n### APM
request profiling\r\n\r\n- Connected Kibana in production mode to a APM
server to measure spans\r\nof the `/upgrade/_perform` request.\r\n-
Additionally, measured new CPU-intensive logic which calculates
rule\r\ndiffs and create rule payloads for upgrades.\r\n- An example
span for a successful upgrade of 2500 rules:\r\n<img width=\"1722\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/07aa3079-5ce4-4b87-ab41-2a3e133316ef\">\r\n-
The new spans for CPU-intensive tasks `getUpgradeableRules`
and\r\n`createModifiedPrebuiltRuleAssets`, which are displayed as
`blocking`,\r\nhave an acceptable span length, and do not have a
considerable overhead\r\non the total length of the request.\r\n\r\n###
Timeout testing\r\n\r\n- Tested Kibana with `--no-base-path` in order to
check for potential\r\ntimeouts in long running requests (ESS Cloud
proxy is supposed to have a\r\nrequest timeout config of 2.5 mins)\r\n-
Still\r\n[confirming](https://elastic.slack.com/archives/C5UDAFZQU/p1730297621776729)\r\nwith
Kibana Operations the behaviour of the timeouts in ESS and\r\nServerless
envs:\r\n- Tested with mock rules (indexed directly to ES) and **no
timeouts\r\noccurred**:\r\n\r\n| Rule Updates | Request time (min)
|\r\n|---------|--------|\r\n| 2000 | 2.1 |\r\n| 2000 | 2.1 |\r\n| 2250
| 2.3 |\r\n| 2500 | 2.6 |\r\n| 3000 | 3.1 |\r\n\r\n\r\n\r\n###
Conclusion\r\n\r\nThe results show that the `/upgrade/_perform` endpoint
performs reliably\r\nunder stress, given the currentexpected request
payloads.\r\n\r\nThe only question to triple check is the behaviour of
server request\r\ntimeouts in Serverless: I'm waiting the Kibana ops
team to get back to\r\nme, even though testing here did not show cases
of timeouts.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<[email protected]>\r\nCo-authored-by: Dmitrii
Shevchenko
<[email protected]>","sha":"bdb6ff128c8e9fbf83d5e38e9a771853a171fdd2"}},{"branch":"8.x","label":"v8.17.0","branchLabelMappingKey":"^v8.17.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Juan Pablo Djeredjian <[email protected]>
mgadewoll pushed a commit to mgadewoll/kibana that referenced this pull request Nov 7, 2024
…197898)

## Summary

- Creates a new `withSyncSecuritySpan` wrapper to measure sync functions
in APM. Adds this wrapper to new CPU intensive logic in the
`/upgrade/_perform` endpoint.
- Do performance testing on the endpoint. See results below.


## Performance testing

### Possible OOMs in production Serverless

Created an Serverless instance and manually installed different Prebuilt
Rules package to force rule upgrades.
- With the current published prebuilt packages, a user cannot update
more than 950 rules with a single request.
- This number is expected to grow, but at a slower pace than the actual
number of rules being published.
- Also, as users start customizing rules, rules with conflicts will be
excluded from bulk requests, which will **make payloads even smaller.**
- Testing the biggest possible upgrade request, Serverless behaved
reliably and no **timeouts** or **OOMs** occurred:

| From version   | To version | Rule Updates | Request time   |
|---------|--------|---------|--------|
| 8.9.9   | 8.15.9 | 913     | 47.3s  |
| 8.9.12  | 8.15.9 | 917     | 52.34s |
| 8.9.15  | 8.15.9 | 928     | 56.08s |
| 8.10.4  | 8.15.9 | 872     | 43.29s |
| 8.10.5  | 8.15.9 | 910     | 52.21s |
| 8.10.6  | 8.15.9 | 913     | 55.92s |
| 8.10.7  | 8.15.9 | 924     | 49.89s |
| 8.11.2  | 8.15.9 | 910     | 56.48s |
| 8.11.5  | 8.15.9 | 928     | 49.22s |
| 8.11.16 | 8.15.9 | 695     | 38.91s |
| 8.12.6  | 8.15.9 | 947     | 51.13s |
| 8.13.11 | 8.15.9 | 646     | 42.98s |

- Given the positive results for much bigger payloads seen in the
**Memory profiling with limited memory in Kibana production mode**
below, we can assume that there's no risk of OOMs in Serverless at the
moment.

### Memory profiling with limited memory in Kibana production mode

- Launched Kibana in Production mode, and set a **memory limit of
700mb** to mimic as closely as possible the Serverless environment
(where memory is a hard constraint)
- Stress tested with big number of requests and saw the following
behaviour:

| Rule Updates   | Request time (min) | OOM error? | Metrics |
|---------|--------|--------|--------|
| 1500 | 1.1 | No |
<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/46303a1a-a929-4c00-8777-8d1f23face17)</details>
|
| 2000 | 1.5 | No |
<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/bd33d259-50fd-42df-947d-3a2e7c5c78c3)</details>
|
| 2500 | 1.8 | No |
<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9145d2e7-e87c-4ba6-8633-7fe1087c29fb)</details>
|
| 2750 | 1.9 | No |
<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9009163e-f58d-4be3-8a1f-87844760a037)</details>
|
| 3000  | - | YES |  |

- Rule upgrade OOM's consistently when the payload is >= 3000 rules, but
behaves reliably below that. Good enough buffer for growth of the
Prebuilt Rules package.
- Also, the saw-toothed shape of the heap used graphics shows that
garbage collection works properly for payloads under 3000 rules.

### APM request profiling

- Connected Kibana in production mode to a APM server to measure spans
of the `/upgrade/_perform` request.
- Additionally, measured new CPU-intensive logic which calculates rule
diffs and create rule payloads for upgrades.
- An example span for a successful upgrade of 2500 rules:
<img width="1722" alt="image"
src="https://github.com/user-attachments/assets/07aa3079-5ce4-4b87-ab41-2a3e133316ef">
- The new spans for CPU-intensive tasks `getUpgradeableRules` and
`createModifiedPrebuiltRuleAssets`, which are displayed as `blocking`,
have an acceptable span length, and do not have a considerable overhead
on the total length of the request.

### Timeout testing

- Tested Kibana with `--no-base-path` in order to check for potential
timeouts in long running requests (ESS Cloud proxy is supposed to have a
request timeout config of 2.5 mins)
- Still
[confirming](https://elastic.slack.com/archives/C5UDAFZQU/p1730297621776729)
with Kibana Operations the behaviour of the timeouts in ESS and
Serverless envs:
- Tested with mock rules (indexed directly to ES) and **no timeouts
occurred**:

| Rule Updates   | Request time (min) |
|---------|--------|
| 2000  | 2.1 |
| 2000  | 2.1 |
| 2250  | 2.3 |
| 2500  | 2.6 |
| 3000  | 3.1 |



### Conclusion

The results show that the `/upgrade/_perform` endpoint performs reliably
under stress, given the currentexpected request payloads.

The only question to triple check is the behaviour of server request
timeouts in Serverless: I'm waiting the Kibana ops team to get back to
me, even though testing here did not show cases of timeouts.

---------

Co-authored-by: Elastic Machine <[email protected]>
Co-authored-by: Dmitrii Shevchenko <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:version Backport to applied version labels ci:project-deploy-security Create a Security Serverless Project ci:project-redeploy Always create a new Cloud project Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area release_note:skip Skip the PR/issue when compiling release notes Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.17.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants