Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.x] [Security Solution] `/upgrade/_perform` performance testing (#197898) #199128

Merged
merged 1 commit into from
Nov 6, 2024

Conversation

kibanamachine
Copy link
Contributor

Backport

This will backport the following commits from main to 8.x:

Questions ?

Please refer to the Backport tool documentation

…197898)

## Summary

- Creates a new `withSyncSecuritySpan` wrapper to measure sync functions
in APM. Adds this wrapper to new CPU intensive logic in the
`/upgrade/_perform` endpoint.
- Do performance testing on the endpoint. See results below.

## Performance testing

### Possible OOMs in production Serverless

Created an Serverless instance and manually installed different Prebuilt
Rules package to force rule upgrades.
- With the current published prebuilt packages, a user cannot update
more than 950 rules with a single request.
- This number is expected to grow, but at a slower pace than the actual
number of rules being published.
- Also, as users start customizing rules, rules with conflicts will be
excluded from bulk requests, which will **make payloads even smaller.**
- Testing the biggest possible upgrade request, Serverless behaved
reliably and no **timeouts** or **OOMs** occurred:

| From version   | To version | Rule Updates | Request time   |
|---------|--------|---------|--------|
| 8.9.9   | 8.15.9 | 913     | 47.3s  |
| 8.9.12  | 8.15.9 | 917     | 52.34s |
| 8.9.15  | 8.15.9 | 928     | 56.08s |
| 8.10.4  | 8.15.9 | 872     | 43.29s |
| 8.10.5  | 8.15.9 | 910     | 52.21s |
| 8.10.6  | 8.15.9 | 913     | 55.92s |
| 8.10.7  | 8.15.9 | 924     | 49.89s |
| 8.11.2  | 8.15.9 | 910     | 56.48s |
| 8.11.5  | 8.15.9 | 928     | 49.22s |
| 8.11.16 | 8.15.9 | 695     | 38.91s |
| 8.12.6  | 8.15.9 | 947     | 51.13s |
| 8.13.11 | 8.15.9 | 646     | 42.98s |

- Given the positive results for much bigger payloads seen in the
**Memory profiling with limited memory in Kibana production mode**
below, we can assume that there's no risk of OOMs in Serverless at the
moment.

### Memory profiling with limited memory in Kibana production mode

- Launched Kibana in Production mode, and set a **memory limit of
700mb** to mimic as closely as possible the Serverless environment
(where memory is a hard constraint)
- Stress tested with big number of requests and saw the following
behaviour:

| Rule Updates   | Request time (min) | OOM error? | Metrics |
|---------|--------|--------|--------|
| 1500 | 1.1 | No |
<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/46303a1a-a929-4c00-8777-8d1f23face17)</details>
|
| 2000 | 1.5 | No |
<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/bd33d259-50fd-42df-947d-3a2e7c5c78c3)</details>
|
| 2500 | 1.8 | No |
<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9145d2e7-e87c-4ba6-8633-7fe1087c29fb)</details>
|
| 2750 | 1.9 | No |
<details><summary>Unfold</summary>![image](https://github.com/user-attachments/assets/9009163e-f58d-4be3-8a1f-87844760a037)</details>
|
| 3000  | - | YES |  |

- Rule upgrade OOM's consistently when the payload is >= 3000 rules, but
behaves reliably below that. Good enough buffer for growth of the
Prebuilt Rules package.
- Also, the saw-toothed shape of the heap used graphics shows that
garbage collection works properly for payloads under 3000 rules.

### APM request profiling

- Connected Kibana in production mode to a APM server to measure spans
of the `/upgrade/_perform` request.
- Additionally, measured new CPU-intensive logic which calculates rule
diffs and create rule payloads for upgrades.
- An example span for a successful upgrade of 2500 rules:
<img width="1722" alt="image"
src="https://github.com/user-attachments/assets/07aa3079-5ce4-4b87-ab41-2a3e133316ef">
- The new spans for CPU-intensive tasks `getUpgradeableRules` and
`createModifiedPrebuiltRuleAssets`, which are displayed as `blocking`,
have an acceptable span length, and do not have a considerable overhead
on the total length of the request.

### Timeout testing

- Tested Kibana with `--no-base-path` in order to check for potential
timeouts in long running requests (ESS Cloud proxy is supposed to have a
request timeout config of 2.5 mins)
- Still
[confirming](https://elastic.slack.com/archives/C5UDAFZQU/p1730297621776729)
with Kibana Operations the behaviour of the timeouts in ESS and
Serverless envs:
- Tested with mock rules (indexed directly to ES) and **no timeouts
occurred**:

| Rule Updates   | Request time (min) |
|---------|--------|
| 2000  | 2.1 |
| 2000  | 2.1 |
| 2250  | 2.3 |
| 2500  | 2.6 |
| 3000  | 3.1 |

### Conclusion

The results show that the `/upgrade/_perform` endpoint performs reliably
under stress, given the currentexpected request payloads.

The only question to triple check is the behaviour of server request
timeouts in Serverless: I'm waiting the Kibana ops team to get back to
me, even though testing here did not show cases of timeouts.

---------

Co-authored-by: Elastic Machine <[email protected]>
Co-authored-by: Dmitrii Shevchenko <[email protected]>
(cherry picked from commit bdb6ff1)
@kibanamachine kibanamachine merged commit cbaad5a into elastic:8.x Nov 6, 2024
39 checks passed
@elasticmachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #52 / task_manager task management scheduled at "before all" hook for "sets scheduledAt to runAt if retryAt is null"

Metrics [docs]

✅ unchanged

cc @jpdjere

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants