-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intake v2 request timeout causes inconsistent log and self-instrumented trace #14232
Comments
hello @up2neck can you please give a bit more informations on the environment? |
Log entry was taken from APM Server logs (stdout, collected by GCP Loggins if it matters) Our team, has 2 environments (both facing the issue): My thoughts are some middleware, probably, rewrites the exact error, which leads to incorrect log entry. |
Thanks for the additional info 🙏🏼 The reason I was asking is that you may have transforms or ingest pipelines in Elasticsearch that remaps fields, in this case the
I am not aware of anything inside apm-server binary modifying the value of this field, hence my confusion as to why it is different in logs than what's stored in ES as trace. |
There are few additional steps on the top of default "APM" integration-provided ingest pipelines, we have, but all of them don't interact with |
@up2neck I am back at looking into the code on the reported discrepancy between log entry and trace entry. so the message "request timed out" and a corresponding 503 http status code is coming from the HTTP request processing itself, there is no rewrite in the logger. I also noticed that the request did time out, matching the log event
trace event
Since I am noticing you have Tail Based Sampling enabled, would you please share the TBS configuration? |
@inge4pres
|
mhmhm ok...
that |
@inge4pres that's set also for head-based sampled transactions. |
@inge4pres |
@up2neck let's recap what we have here. Since I am unable to replicate the reported behavior in an Elastic Cloud ESS instance nor locally, please confirm my understanding below is correct. A Java client app (client in the sense of APM client agent) sends an event into APM Server with an HTTP request that times out at ~3.5 seconds. Is it plausible that, the connection from APM Server to Elasticsearch being interrupted exercises this behavior? I am still puzzled to what's actually happening to be honest, because the log message reports a duration smaller than the trace by ~500 microseconds. log event
trace event
The You said you fetched the trace document directly from ES, which makes sense in the scenario that I am imagining: trace event processed by APM in 3.5 seconds, responded with a 4xx to the client, ingestion into Elasticsearch times out or connection is truncated, log message produced for the same trace, event lands in ES anyway. Do you have telemetry in your Kubernetes environment that is able to spot network connections being reset as part of pod to pod communication? |
@up2neck thanks for the bug report. I managed to reproduce it. apm-server request log {"log.level":"error","@timestamp":"2024-12-07T00:18:06.391Z","log.logger":"request","log.origin":{"function":"github.com/elastic/apm-server/internal/beater/api.apmMiddleware.LogMiddleware.func1.1","file.name":"middleware/log_middleware.go","file.line":59},"message":"request timed out","service.name":"apm-server","url.original":"/intake/v2/events","http.request.method":"POST","user_agent.original":"curl/8.5.0","source.address":"127.0.0.1","trace.id":"7d605e1eeeed169c7580ea71d20cfa3c","transaction.id":"7d605e1eeeed169c","http.request.id":"7d605e1eeeed169c","event.duration":266192314,"http.request.body.bytes":303694,"http.response.status_code":503,"error.message":"request timed out","ecs.version":"1.6.0"} apm-server self-instrumentation trace document {
"_index": ".ds-traces-apm-default-2024.12.06-000001",
"_id": "qU57npMBrE4DWVdURCBD",
"_version": 1,
"_score": 0,
"_source": {
"agent": {
"name": "go",
"version": "2.6.0"
},
"process": {
"args": [
"/home/carson/.cache/JetBrains/GoLand2024.3/tmp/GoLand/___main_local_issue_14232",
"-e",
"-c",
"apm-server-issue-14232.yml"
],
"parent": {
"pid": 21362
},
"pid": 1096402,
"title": "___main_local_i"
},
"source": {
"ip": "127.0.0.1"
},
"processor": {
"event": "transaction"
},
"url": {
"path": "/intake/v2/events",
"scheme": "http",
"port": 8200,
"domain": "localhost",
"full": "http://localhost:8200/intake/v2/events"
},
"observer": {
"hostname": "carson-elastic",
"type": "apm-server",
"version": "8.14.3"
},
"trace": {
"id": "7d605e1eeeed169c7580ea71d20cfa3c"
},
"@timestamp": "2024-12-07T00:18:06.125Z",
"data_stream": {
"namespace": "default",
"type": "traces",
"dataset": "apm"
},
"service": {
"node": {
"name": "carson-elastic"
},
"name": "apm-server",
"runtime": {
"name": "gc",
"version": "go1.23.0"
},
"language": {
"name": "go",
"version": "go1.23.0"
},
"version": "8.14.3"
},
"host": {
"hostname": "carson-elastic",
"os": {
"platform": "linux"
},
"name": "carson-elastic",
"architecture": "amd64"
},
"client": {
"ip": "127.0.0.1"
},
"http": {
"request": {
"headers": {
"Accept": [
"*/*"
],
"User-Agent": [
"curl/8.5.0"
],
"Content-Length": [
"303694"
],
"Content-Type": [
"application/x-ndjson"
]
},
"method": "POST"
},
"response": {
"headers": {
"X-Content-Type-Options": [
"nosniff"
],
"Connection": [
"Close"
],
"Content-Type": [
"application/json"
]
},
"status_code": 400
},
"version": "1.1"
},
"event": {
"success_count": 1,
"outcome": "success"
},
"transaction": {
"result": "HTTP 4xx",
"duration": {
"us": 266341
},
"representative_count": 1,
"name": "POST /intake/v2/events",
"id": "7d605e1eeeed169c",
"span_count": {
"dropped": 0,
"started": 11
},
"type": "request",
"sampled": true
},
"user_agent": {
"original": "curl/8.5.0",
"name": "curl",
"device": {
"name": "Other"
},
"version": "8.5.0"
},
"span": {
"id": "7d605e1eeeed169c"
},
"timestamp": {
"us": 1733530686125111
}
},
"fields": {
"http.request.headers.Content-Length": [
"303694"
],
"transaction.name.text": [
"POST /intake/v2/events"
],
"http.request.headers.Accept": [
"*/*"
],
"http.response.headers.Connection": [
"Close"
],
"transaction.representative_count": [
1
],
"user_agent.original.text": [
"curl/8.5.0"
],
"process.parent.pid": [
21362
],
"host.hostname": [
"carson-elastic"
],
"process.pid": [
1096402
],
"service.language.name": [
"go"
],
"transaction.result": [
"HTTP 4xx"
],
"process.title.text": [
"___main_local_i"
],
"transaction.id": [
"7d605e1eeeed169c"
],
"http.request.method": [
"POST"
],
"processor.event": [
"transaction"
],
"source.ip": [
"127.0.0.1"
],
"agent.name": [
"go"
],
"host.name": [
"carson-elastic"
],
"user_agent.version": [
"8.5.0"
],
"http.response.status_code": [
400
],
"http.version": [
"1.1"
],
"event.outcome": [
"success"
],
"user_agent.original": [
"curl/8.5.0"
],
"transaction.duration.us": [
266341
],
"service.runtime.version": [
"go1.23.0"
],
"span.id": [
"7d605e1eeeed169c"
],
"client.ip": [
"127.0.0.1"
],
"user_agent.name": [
"curl"
],
"data_stream.type": [
"traces"
],
"host.architecture": [
"amd64"
],
"timestamp.us": [
1733530686125111
],
"url.path": [
"/intake/v2/events"
],
"observer.type": [
"apm-server"
],
"observer.version": [
"8.14.3"
],
"agent.version": [
"2.6.0"
],
"transaction.name": [
"POST /intake/v2/events"
],
"process.title": [
"___main_local_i"
],
"service.node.name": [
"carson-elastic"
],
"url.scheme": [
"http"
],
"transaction.sampled": [
true
],
"trace.id": [
"7d605e1eeeed169c7580ea71d20cfa3c"
],
"event.success_count": [
1
],
"transaction.span_count.dropped": [
0
],
"url.port": [
8200
],
"http.request.headers.Content-Type": [
"application/x-ndjson"
],
"url.full": [
"http://localhost:8200/intake/v2/events"
],
"http.request.headers.User-Agent": [
"curl/8.5.0"
],
"service.name": [
"apm-server"
],
"data_stream.namespace": [
"default"
],
"service.runtime.name": [
"gc"
],
"process.args": [
"/home/carson/.cache/JetBrains/GoLand2024.3/tmp/GoLand/___main_local_issue_14232",
"-e",
"-c",
"apm-server-issue-14232.yml"
],
"observer.hostname": [
"carson-elastic"
],
"url.full.text": [
"http://localhost:8200/intake/v2/events"
],
"transaction.type": [
"request"
],
"transaction.span_count.started": [
11
],
"@timestamp": [
"2024-12-07T00:18:06.125Z"
],
"service.version": [
"8.14.3"
],
"host.os.platform": [
"linux"
],
"data_stream.dataset": [
"apm"
],
"http.response.headers.Content-Type": [
"application/json"
],
"http.response.headers.X-Content-Type-Options": [
"nosniff"
],
"service.language.version": [
"go1.23.0"
],
"url.domain": [
"localhost"
],
"user_agent.device.name": [
"Other"
]
}
} Notice the same trace id The setup I use to reproduce this is
The effect of this is that while the request from curl will be seen and logged as 503 by apm-server because the client (curl) terminated the conection, the actual request response, if the client was alive, would be 400, because of a line exceeding the max_event_size limit. As suspected in the initial bug report, this is caused by intake v2 endpoint middleware order and response code manipulation. As for the actual scenario that caused what @up2neck observed, we'll need to dig deeper for a more plausible root cause. I have only tried with max_event_size, which may not be realistic as java agent might not send such a large event, and apm-server have a high enough default max_event_size. |
I've updated the above comment to avoid some confusion.
The issue is now clear and should be ready for the team to move forward to fix the bug. |
@carsonip @inge4pres |
@up2neck thanks a lot for that 🙏🏼 |
@inge4pres raised some questions about my explanation and we managed to get to the bottom of it. I've crossed out some incorrect parts in my previous comment. I had a wrong assumption on the actual middleware orders, although I managed to reproduce the problem correctly. The real issue here is that the 4xx was set (setting In other words, timeout middleware is missing @inge4pres and I are working on a fix. |
Great news! Thank you for detailed explanation, it's very useful to understand some parts of APM Server and how them work. |
@carsonip |
@1pkg |
APM Server log events and correlating trace (with self-instrumentation enabled) contains 2 different HTTP errors:
Whether log event contains 503 error with "request timed out" error:
Log entry
APM interface shown actual error was:
Trace body
The text was updated successfully, but these errors were encountered: