Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Internal Server Error returns when requesting trace sample of Elasticsearch dependency #195882

Open
ablnk opened this issue Oct 11, 2024 · 6 comments
Labels
apm:serverless apm bug Fixes for quality problems that affect the customer experience obs-serverless-performance Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team

Comments

@ablnk
Copy link

ablnk commented Oct 11, 2024

Description:
GET /internal/apm/dependencies/charts/distribution?percentileThreshold=95&dependencyName=elasticsearch&spanName=<> request fails with status code 500 and returns "search_phase_execution_exception Caused by: circuit_breaking_exception" when requesting trace sample of Elasticsearch dependency.

Error
search_phase_execution_exception Caused by: circuit_breaking_exception: [parent] Data too large, data for [<reduce_aggs>] would be [4130347322/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4130329600/3.8gb], new bytes reserved: [17722/17.3kb], usages [model_inference=0/0b, eql_sequence=0/0b, fielddata=0/0b, request=374915/366.1kb, inflight_requests=18532/18kb]; for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker-errors.html Root causes: task_cancelled_exception: task cancelled [Fatal failure during search: failed to merge result [[parent] Data too large, data for [<reduce_aggs>] would be [4130347322/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4130329600/3.8gb], new bytes reserved: [17722/17.3kb], usages [model_inference=0/0b, eql_sequence=0/0b, fielddata=0/0b, request=374915/366.1kb, inflight_requests=18532/18kb]; for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker-errors.html]] task_cancelled_exception: task cancelled [Fatal failure during search: failed to merge 

The issue is only reproducible in a test LogsDB environment with search power set to 35.

Data in the search period:

Data view Docs
Logs 2,732,575,843
Metrics 601,310,529
Metrics - Kubernetes 261,604,394
APM 1,005,652,761

Logs

Steps to reproduce:

  1. Go to Applications - Dependencies.
  2. Select Elasticsearch dependency, then go to "Operations" tab.
  3. Set the search period to Last 7 Days or larger.
  4. Select the most impactful operation.
  5. Verify that a trace sample is loaded.

Expected behavior:

Trace sample is loaded.

@ablnk ablnk added apm:serverless bug Fixes for quality problems that affect the customer experience labels Oct 11, 2024
@botelastic botelastic bot added the needs-team Issues missing a team label label Oct 11, 2024
@jughosta jughosta added the Team:APM All issues that need APM UI Team support label Oct 15, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:APM)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Oct 15, 2024
@smith smith added the Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team label Oct 15, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

@smith smith added apm and removed Team:APM All issues that need APM UI Team support labels Oct 15, 2024
@ablnk
Copy link
Author

ablnk commented Nov 18, 2024

Since ES client timeout has been increased, "Request timed out" error no longer reproduces. However, the described scenario doesn't work properly, in an environment with serverless.search.search_power_max: 35 I'm now getting circuit breaking exception (the search period set to Last 30 Days):

Error
search_phase_execution_exception Caused by: circuit_breaking_exception: [parent] Data too large, data for [<reduce_aggs>] would be [4130347322/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4130329600/3.8gb], new bytes reserved: [17722/17.3kb], usages [model_inference=0/0b, eql_sequence=0/0b, fielddata=0/0b, request=374915/366.1kb, inflight_requests=18532/18kb]; for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker-errors.html Root causes: task_cancelled_exception: task cancelled [Fatal failure during search: failed to merge result [[parent] Data too large, data for [<reduce_aggs>] would be [4130347322/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4130329600/3.8gb], new bytes reserved: [17722/17.3kb], usages [model_inference=0/0b, eql_sequence=0/0b, fielddata=0/0b, request=374915/366.1kb, inflight_requests=18532/18kb]; for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker-errors.html]] task_cancelled_exception: task cancelled [Fatal failure during search: failed to merge 

In an environment with serverless.search.search_power_max: 45 this not reproduces, the request completes with status 200 OK (requested data is returned) but Kibana doesn't render "waterfall" component, see the recording:
Image

@andrewvc
Copy link
Contributor

@ablnk would it be reasonable to expect that, regardless of search power, we may still get timeouts with more data?

@chrisdistasio
Copy link

I believe we've ruled out changing SP to 35 and are defaulting to a # of 45. @cachedout can you please confirm?

@ablnk
Copy link
Author

ablnk commented Nov 19, 2024

@chrisdistasio this is more for awareness what you may encounter in the case of using SP35, not a candidate for hot fix since we're defaulting to SP45.
@andrewvc I think so. Haven't tested what can happen if you set a really large periods like 90 days, assuming this is not a common use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apm:serverless apm bug Fixes for quality problems that affect the customer experience obs-serverless-performance Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team
Projects
None yet
Development

No branches or pull requests

6 participants