-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf regression, too many small bulk requests #13024
Comments
@StephanErb thanks for raising this. I've had a look at the code changes, and nothing obviously stands out. There were some changes in the ES output code, but they're not enabled in APM Server 8.13.x. We'll need to try and reproduce the issue and work back from there.
Just to clarify: are you referring to the number of documents successfully indexed in Elasticsearch?
APM Server has its own ES output implementation, completely independent of Elastic Agent/Beats. |
We have upgraded the APM Server and Elastic but the APM clients remained unchanged. There was such no change in data forwarded to APM Server and ultimately indexed into Elastic. Some stats on ingestion from stack monitoring:
Please note we are running a When checking
As per This would again be in-line why there was a big drop in IOPS once we moved over to |
Hi 👋 go-docappender (that's where the indexing logic is implemented) was bumped between those versions. I'd like to confirm something going deeper:
As mentioned above, the APM Server has a custom ES output implementation so it's not using beats presets. |
Thanks for picking this up. We just upgraded in the Cloud console without any additional configuration changes for the APM server (flush interval, etc) or APM server memory changes. We are running two APM servers in the max configuration (2x30 GB). This is how the load looks like as per Stack Monitoring (incl request stats): Unfortunately I don't have enough history retention on the monitoring cluster too look at what happened around the time we upgraded. In our logs we see a relative large number of indexing errors:
Could those errors lead to higher-ingest load now that elastic/go-docappender#99 has landed? |
Thank you for the followup! Are you using any sampling config ? tail based sampling or head based sampling ?
That specific PR shouldn't impact the number of requests because per-document retries are not enabled in APM Server. It's possible for the ingest load to increase if APM Server receives specific status code and the request retry code is executed. There are two retry logic:
|
@StephanErb it would be great if you could provide the information on whether you have sampling enabled, and if so, whether it is tail based or head based sampling. A bugfix was rolled out in |
@simitt this is with:
|
hello @StephanErb apologies for the delay here, I wanted to ask one clarification which I didn't see mentioned in the thread. |
Yes, both got updated to the same version via a regular update initiated on Elastic cloud. |
I'm investigating this issue and I think I've found the bug. I'll create a PR urgently to fix this. |
Updates the `FlushBytes` setting to default to 1 mib and only override to 24kb if the user has explicitly set it to 24kb. Fixes elastic#13024 Signed-off-by: Marc Lopez Rubio <[email protected]>
Updates the `FlushBytes` setting to default to 1 mib and only override to 24kb if the user has explicitly set it to 24kb. Fixes #13024 --------- Signed-off-by: Marc Lopez Rubio <[email protected]> (cherry picked from commit a453a88) # Conflicts: # changelogs/head.asciidoc # internal/beater/beater.go
Updates the `FlushBytes` setting to default to 1 mib and only override to 24kb if the user has explicitly set it to 24kb. Fixes #13024 --------- Signed-off-by: Marc Lopez Rubio <[email protected]> (cherry picked from commit a453a88)
Updates the `FlushBytes` setting to default to 1 mib and only override to 24kb if the user has explicitly set it to 24kb. Fixes #13024 --------- Signed-off-by: Marc Lopez Rubio <[email protected]> (cherry picked from commit a453a88) Co-authored-by: Marc Lopez Rubio <[email protected]>
Updates the `FlushBytes` setting to default to 1 mib and only override to 24kb if the user has explicitly set it to 24kb. Fixes #13024 --------- Signed-off-by: Marc Lopez Rubio <[email protected]> (cherry picked from commit a453a88) Signed-off-by: Marc Lopez Rubio <[email protected]> # Conflicts: # changelogs/head.asciidoc # internal/beater/beater.go
…13577) Updates the `FlushBytes` setting to default to 1 mib and only override to 24kb if the user has explicitly set it to 24kb. Fixes #13024 --------- Signed-off-by: Marc Lopez Rubio <[email protected]> (cherry picked from commit a453a88) Signed-off-by: Marc Lopez Rubio <[email protected]> # Conflicts: # changelogs/head.asciidoc # internal/beater/beater.go * fix conflicts Signed-off-by: inge4pres <[email protected]> Signed-off-by: Marc Lopez Rubio <[email protected]> * lint Signed-off-by: inge4pres <[email protected]> Signed-off-by: Marc Lopez Rubio <[email protected]> * fix dependency modified by IDE Signed-off-by: inge4pres <[email protected]> Signed-off-by: Marc Lopez Rubio <[email protected]> * remove RequireDataStream Signed-off-by: Marc Lopez Rubio <[email protected]> --------- Signed-off-by: inge4pres <[email protected]> Signed-off-by: Marc Lopez Rubio <[email protected]> Co-authored-by: Marc Lopez Rubio <[email protected]> Co-authored-by: inge4pres <[email protected]>
APM Server version (
apm-server version
):8.13.2
Description of the problem including expected versus actual behavior:
I believe there is a regression in the number of bulk requests/s emitted by APM Server towards Elastic. I don't have all data for it, but I believe the issue started in 8.12.x and then got worse with the upgrade to 8.13.x.
This is from the Elastic cloud console: On otherwise identical clients/agents, we see that with the upgrade from 8.12 to 8.13 the number of indexing requests/s jumped noticeably.
The number of documents/s that get ingested didn't really change much, but our hot nodes started to run under higher load. Fortunately, we observed that setting
index.translog.durability=async
on the all APM indices drops IOPS per hot node back to acceptable levels:On a decently tuned
bulk_max_size
setting I'd not expect such a big drop in IOPS/s. I am thus wondering if the APM server is misconfigured in newer versions.It almost feels like the server uses the agent defaults instead of its custom bulk size option? Was it maybe forgotten to set
preset: custom
into the server config once elastic/elastic-agent#3797 got merged into the agent?The text was updated successfully, but these errors were encountered: