Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf regression, too many small bulk requests #13024

Closed
StephanErb opened this issue Apr 22, 2024 · 10 comments · Fixed by #13576
Closed

Perf regression, too many small bulk requests #13024

StephanErb opened this issue Apr 22, 2024 · 10 comments · Fixed by #13576
Assignees
Labels

Comments

@StephanErb
Copy link

StephanErb commented Apr 22, 2024

APM Server version (apm-server version):
8.13.2

Description of the problem including expected versus actual behavior:

I believe there is a regression in the number of bulk requests/s emitted by APM Server towards Elastic. I don't have all data for it, but I believe the issue started in 8.12.x and then got worse with the upgrade to 8.13.x.

This is from the Elastic cloud console: On otherwise identical clients/agents, we see that with the upgrade from 8.12 to 8.13 the number of indexing requests/s jumped noticeably.

Screenshot_2024-04-16_at_23 59 50

The number of documents/s that get ingested didn't really change much, but our hot nodes started to run under higher load. Fortunately, we observed that setting index.translog.durability=async on the all APM indices drops IOPS per hot node back to acceptable levels:

Screenshot_2024-04-21_at_21 35 36

On a decently tuned bulk_max_size setting I'd not expect such a big drop in IOPS/s. I am thus wondering if the APM server is misconfigured in newer versions.

It almost feels like the server uses the agent defaults instead of its custom bulk size option? Was it maybe forgotten to set preset: custom into the server config once elastic/elastic-agent#3797 got merged into the agent?

@StephanErb StephanErb added the bug label Apr 22, 2024
@StephanErb StephanErb changed the title Perf regression, too many requests small bulk requests Perf regression, too many small bulk requests Apr 22, 2024
@axw
Copy link
Member

axw commented Apr 23, 2024

@StephanErb thanks for raising this. I've had a look at the code changes, and nothing obviously stands out. There were some changes in the ES output code, but they're not enabled in APM Server 8.13.x. We'll need to try and reproduce the issue and work back from there.

The number of documents/s that get ingested didn't really change much

Just to clarify: are you referring to the number of documents successfully indexed in Elasticsearch?

It almost feels like the server uses the agent defaults instead of its custom bulk size option? Was it maybe forgotten to set preset: custom into the server config once elastic/elastic-agent#3797 got merged into the agent?

APM Server has its own ES output implementation, completely independent of Elastic Agent/Beats.

@StephanErb
Copy link
Author

StephanErb commented Apr 23, 2024

The number of documents/s that get ingested didn't really change much

Just to clarify: are you referring to the number of documents successfully indexed in Elasticsearch?

We have upgraded the APM Server and Elastic but the APM clients remained unchanged. There was such no change in data forwarded to APM Server and ultimately indexed into Elastic.

Some stats on ingestion from stack monitoring:

  • Ingestion of 20000 docs/s into .ds-metrics-apm.app.all-default-2024.04.23-004117
  • Ingestion of 4000 docs/s into .ds-metrics-apm.internal-default-2024.04.21-000931
  • Ingestion of 2000 docs/s into .ds-traces-apm-default-2024.04.23-001742
  • Ingestion of 1000 docs/s into .ds-logs-apm.app.all-default-2024.04.17-000162

Please note we are running a reroute pipeline to prevent per-service indices and instead have them all flow into one all datastream for all services to prevent a shard explosion.

When checking GET _tasks?pretty&human&detailed=true&actions=indices:data/write/bulk we see a few hundred parallel tasks such as:

"iaCFp3aFSnuliL3GpI_fnQ:1325096688": {
      "node": "iaCFp3aFSnuliL3GpI_fnQ",
      "id": 1325096688,
      "type": "transport",
      "action": "indices:data/write/bulk",
      "description": "requests[211], indices[metrics-apm <SNIP not to leak service names> traces-apm-default]",
      "start_time": "2024-04-23T20:26:19.872Z",
      "start_time_in_millis": 1713903979872,
      "running_time": "299.7ms",
      "running_time_in_nanos": 299780808,
      "cancellable": false,
      "headers": {
        "X-elastic-product-origin": "observability"
      }
},
"iaCFp3aFSnuliL3GpI_fnQ:1325097388": {
      "node": "iaCFp3aFSnuliL3GpI_fnQ",
      "id": 1325097388,
      "type": "transport",
      "action": "indices:data/write/bulk",
      "description": "requests[261], indices[metrics-apm.internal-default, metrics-apm.app <SNIP not to leak service names>]",
      "start_time": "2024-04-23T20:26:20.165Z",
      "start_time_in_millis": 1713903980165,
      "running_time": "61.3ms",
      "running_time_in_nanos": 61304826,
      "cancellable": false,
      "headers": {
        "X-elastic-product-origin": "observability"
    }
}

As per requests[211] and requests[261] my interpretation is we are seeing a batch size of ~200 documents which appears relatively low to me given the overall throughput of the system. (I hope this is correct, I could not find any docs on the format).

This would again be in-line why there was a big drop in IOPS once we moved over to index.translog.durability=async

@kruskall
Copy link
Member

kruskall commented Apr 29, 2024

Hi 👋

go-docappender (that's where the indexing logic is implemented) was bumped between those versions.
The diff of the changes is at elastic/go-docappender@c69a1cf...v1.0.0

I'd like to confirm something going deeper:

  • Can you confirm no config option was changed between versions ? Specifically flush interval, compression level, flush bytes, max requests and retry backoff settings.
  • Was there any change to the max memory assigned to APM Server ? (memory limits affect the max buffer size)
  • How is the number of request measured ? (there were some fixes to some specific metrics)

As mentioned above, the APM Server has a custom ES output implementation so it's not using beats presets.
One thing that could affect the number of request is status codes for the retry config.
We're retrying for 429, 502, 503, 504. Could you check if there is an increase in the number of failed requests or any specific status code ?

@StephanErb
Copy link
Author

Thanks for picking this up.

We just upgraded in the Cloud console without any additional configuration changes for the APM server (flush interval, etc) or APM server memory changes. We are running two APM servers in the max configuration (2x30 GB).

This is how the load looks like as per Stack Monitoring (incl request stats):
Screenshot 2024-04-29 at 12 45 11
Screenshot 2024-04-29 at 12 45 20

Unfortunately I don't have enough history retention on the monitoring cluster too look at what happened around the time we upgraded.

In our logs we see a relative large number of indexing errors:

10:23:19.114 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:366] object mapping for [kafka.consumer.fetch.manager.records.lag] tried to parse field [lag] as object, but found a concrete value
10:23:19.115 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:366] object mapping for [kafka.consumer.fetch.manager.records.lag] tried to parse field [lag] as object, but found a concrete value
10:23:19.116 agent.log [][access]   "POST /intake/v2/events HTTP/" 202
10:23:19.117 agent.log [][access]   "POST /intake/v2/events HTTP/" 202
10:23:19.118 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:366] object mapping for [kafka.consumer.fetch.manager.records.lag] tried to parse field [lag] as object, but found a concrete value
10:23:19.118 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:156] failed to parse field [hikaricp.connections] of type [double] in document with id 'yIhhKY8BTYD0o0NlX6M_'

Could those errors lead to higher-ingest load now that elastic/go-docappender#99 has landed?

@kruskall
Copy link
Member

Thank you for the followup!

Are you using any sampling config ? tail based sampling or head based sampling ?


Could those errors lead to higher-ingest load now that elastic/go-docappender#99 has landed?

That specific PR shouldn't impact the number of requests because per-document retries are not enabled in APM Server. It's possible for the ingest load to increase if APM Server receives specific status code and the request retry code is executed.

There are two retry logic:

  • document retries: this one retries single documents in a bulk request if they return specific status codes. This is not enabled in APM Server and cannot be enabled by config.
  • request retries: this one retries the HTTP request (bulk index) if ES responds with specific status codes (429, 502, 503, 504).

@simitt
Copy link
Contributor

simitt commented May 23, 2024

@StephanErb it would be great if you could provide the information on whether you have sampling enabled, and if so, whether it is tail based or head based sampling. A bugfix was rolled out in 8.13 that might explain the behaviour, but to confirm we would need more information about the sampling.

@StephanErb
Copy link
Author

@simitt this is with:

  • head-based sampling
  • sample rate of about 10%

@simitt simitt assigned marclop, carsonip and lahsivjar and unassigned marclop and kruskall May 23, 2024
@lahsivjar lahsivjar assigned vikmik and inge4pres and unassigned lahsivjar and carsonip Jun 13, 2024
@inge4pres
Copy link
Contributor

hello @StephanErb apologies for the delay here, I wanted to ask one clarification which I didn't see mentioned in the thread.
Did you also upgrade Elasticsearch to the same 8.13.x version as apm-server?

@StephanErb
Copy link
Author

Yes, both got updated to the same version via a regular update initiated on Elastic cloud.

@marclop marclop self-assigned this Jul 4, 2024
@marclop
Copy link
Contributor

marclop commented Jul 4, 2024

I'm investigating this issue and I think I've found the bug. I'll create a PR urgently to fix this.

marclop added a commit to marclop/apm-server that referenced this issue Jul 4, 2024
Updates the `FlushBytes` setting to default to 1 mib and only override
to 24kb if the user has explicitly set it to 24kb.

Fixes elastic#13024

Signed-off-by: Marc Lopez Rubio <[email protected]>
@marclop marclop closed this as completed in a453a88 Jul 4, 2024
mergify bot pushed a commit that referenced this issue Jul 4, 2024
Updates the `FlushBytes` setting to default to 1 mib and only override
to 24kb if the user has explicitly set it to 24kb.

Fixes #13024
---------

Signed-off-by: Marc Lopez Rubio <[email protected]>
(cherry picked from commit a453a88)

# Conflicts:
#	changelogs/head.asciidoc
#	internal/beater/beater.go
mergify bot pushed a commit that referenced this issue Jul 5, 2024
Updates the `FlushBytes` setting to default to 1 mib and only override
to 24kb if the user has explicitly set it to 24kb.

Fixes #13024
---------

Signed-off-by: Marc Lopez Rubio <[email protected]>
(cherry picked from commit a453a88)
mergify bot added a commit that referenced this issue Jul 5, 2024
Updates the `FlushBytes` setting to default to 1 mib and only override
to 24kb if the user has explicitly set it to 24kb.

Fixes #13024
---------

Signed-off-by: Marc Lopez Rubio <[email protected]>
(cherry picked from commit a453a88)

Co-authored-by: Marc Lopez Rubio <[email protected]>
marclop added a commit that referenced this issue Jul 5, 2024
Updates the `FlushBytes` setting to default to 1 mib and only override
to 24kb if the user has explicitly set it to 24kb.

Fixes #13024
---------

Signed-off-by: Marc Lopez Rubio <[email protected]>
(cherry picked from commit a453a88)
Signed-off-by: Marc Lopez Rubio <[email protected]>

# Conflicts:
#	changelogs/head.asciidoc
#	internal/beater/beater.go
marclop added a commit that referenced this issue Jul 5, 2024
…13577)

Updates the `FlushBytes` setting to default to 1 mib and only override
to 24kb if the user has explicitly set it to 24kb.

Fixes #13024
---------

Signed-off-by: Marc Lopez Rubio <[email protected]>
(cherry picked from commit a453a88)
Signed-off-by: Marc Lopez Rubio <[email protected]>

# Conflicts:
#	changelogs/head.asciidoc
#	internal/beater/beater.go

* fix conflicts

Signed-off-by: inge4pres <[email protected]>
Signed-off-by: Marc Lopez Rubio <[email protected]>

* lint

Signed-off-by: inge4pres <[email protected]>
Signed-off-by: Marc Lopez Rubio <[email protected]>

* fix dependency modified by IDE

Signed-off-by: inge4pres <[email protected]>
Signed-off-by: Marc Lopez Rubio <[email protected]>

* remove RequireDataStream

Signed-off-by: Marc Lopez Rubio <[email protected]>

---------

Signed-off-by: inge4pres <[email protected]>
Signed-off-by: Marc Lopez Rubio <[email protected]>
Co-authored-by: Marc Lopez Rubio <[email protected]>
Co-authored-by: inge4pres <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants