Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intake v2 request timeout causes inconsistent log and self-instrumented trace #14232

Closed
up2neck opened this issue Oct 2, 2024 · 19 comments · Fixed by #15123
Closed

Intake v2 request timeout causes inconsistent log and self-instrumented trace #14232

up2neck opened this issue Oct 2, 2024 · 19 comments · Fixed by #15123
Assignees
Labels

Comments

@up2neck
Copy link
Contributor

up2neck commented Oct 2, 2024

APM Server log events and correlating trace (with self-instrumentation enabled) contains 2 different HTTP errors:

Whether log event contains 503 error with "request timed out" error:

Log entry
{
  "message": "request timed out",
  "url.original": "/intake/v2/events",
  "http.request.body.bytes": 0,
  "log.level": "error",
  "transaction.id": "d91bb495a6ab25bb",
  "http.request.method": "POST",
  "ecs.version": "1.6.0",
  "http.response.status_code": 503,
  "error.message": "request timed out",
  "log.logger": "request",
  "@timestamp": "2024-10-02T07:22:04.439Z",
  "service.name": "apm-server",
  "user_agent.original": "apm-agent-java/1.45.0 (autocode-app 2.0.0)",
  "log.origin": {
    "function": "github.com/elastic/apm-server/internal/beater/api.apmMiddleware.LogMiddleware.func1.1",
    "file.name": "middleware/log_middleware.go",
    "file.line": 59
  },
  "event.duration": 3499790781,
  "http.request.id": "d91bb495a6ab25bb",
  "source.address": "000.000.000.000",
  "trace.id": "d91bb495a6ab25bbdede8859c7cf8c8a"
}

APM interface shown actual error was:

Image

Trace body
{
  "_index": ".ds-traces-apm-dummy_sub2-2024.09.20-000009",
  "_id": "urkbTJIBO4jYgZRmiL6k",
  "_version": 1,
  "_score": 0,
  "_source": {
    "agent": {
      "name": "go",
      "version": "2.6.0"
    },
    "process": {
      "args": [
        "apm-server",
        "run",
        "-e",
        "-c",
        "config/config-secret/apm-server.yml"
      ],
      "pid": 1,
      "title": "apm-server"
    },
    "source": {
      "nat": {
        "ip": "00.000.00.000"
      },
      "ip": "000.000.000.000"
    },
    "processor": {
      "event": "transaction"
    },
    "url": {
      "path": "/intake/v2/events",
      "scheme": "https",
      "domain": "mydomain",
      "full": "https://mydomain/intake/v2/events"
    },
    "labels": {
      "project": "dummy-sub2"
    },
    "cloud": {
      "availability_zone": "europe-west3-a",
      "instance": {
        "name": "gke-dummy-sub-elastic-europe-w-generic-c4e5a328-8ipl",
        "id": "6772592666386103052"
      },
      "provider": "gcp",
      "project": {
        "id": "or2-ms-dummy-sub-elastic-t1iylu"
      },
      "region": "europe-west3"
    },
    "edp": {
      "domain": "mydomain"
    },
    "observer": {
      "hostname": "apm-server-v2-apm-server-6f6db466bd-22ccb",
      "type": "apm-server",
      "version": "8.14.3"
    },
    "trace": {
      "id": "d91bb495a6ab25bbdede8859c7cf8c8a"
    },
    "@timestamp": "2024-10-02T07:22:00.940Z",
    "data_stream": {
      "namespace": "dummy_sub2",
      "type": "traces",
      "dataset": "apm"
    },
    "service": {
      "node": {
        "name": "apm-server-v2-apm-server-6f6db466bd-22ccb"
      },
      "environment": "sandbox",
      "name": "apm-server",
      "runtime": {
        "name": "gc",
        "version": "go1.22.7"
      },
      "language": {
        "name": "go",
        "version": "go1.22.7"
      },
      "version": "8.14.3"
    },
    "host": {
      "hostname": "apm-server-v2-apm-server-6f6db466bd-22ccb",
      "os": {
        "platform": "linux"
      },
      "name": "apm-server-v2-apm-server-6f6db466bd-22ccb",
      "architecture": "amd64"
    },
    "client": {
      "geo": {
        "region_iso_code": "BE-BRU",
        "continent_name": "Europe",
        "city_name": "Brussels",
        "country_iso_code": "BE",
        "country_name": "Belgium",
        "location": {
          "lon": 4.347,
          "lat": 50.8534
        },
        "region_name": "Brussels Capital"
      },
      "ip": "000.000.000.000"
    },
    "http": {
      "request": {
        "headers": {
          "X-Cloud-Trace-Context": [
            "61ce7649b7546eb12e48bd1b05dc5acf/7453092485412124311"
          ],
          "Cache-Control": [
            "no-cache"
          ],
          "Accept": [
            "*/*"
          ],
          "Content-Encoding": [
            "deflate"
          ],
          "X-Forwarded-Proto": [
            "https"
          ],
          "User-Agent": [
            "apm-agent-java/1.45.0 (autocode-app 2.0.0)"
          ],
          "X-Forwarded-For": [
            "000.000.000.000,00.00.00.000"
          ],
          "Pragma": [
            "no-cache"
          ],
          "Via": [
            "1.1 google"
          ],
          "Content-Type": [
            "application/x-ndjson"
          ]
        },
        "method": "POST",
        "cookies_list": [],
        "cookies": {}
      },
      "response": {
        "headers": {
          "X-Content-Type-Options": [
            "nosniff"
          ],
          "Connection": [
            "Close"
          ],
          "Content-Type": [
            "application/json"
          ]
        },
        "status_code": 400
      },
      "version": "1.1"
    },
    "event": {
      "agent_id_status": "missing",
      "ingested": "2024-10-02T07:22:05Z",
      "success_count": 1,
      "outcome": "success"
    },
    "transaction": {
      "result": "HTTP 4xx",
      "duration": {
        "us": 3499844
      },
      "representative_count": 1,
      "name": "POST /intake/v2/events",
      "id": "d91bb495a6ab25bb",
      "span_count": {
        "dropped": 0,
        "started": 0
      },
      "type": "request",
      "sampled": true
    },
    "user_agent": {
      "original": "apm-agent-java/1.45.0 (autocode-app 2.0.0)",
      "name": "Other",
      "device": {
        "name": "Other"
      }
    },
    "span": {
      "id": "d91bb495a6ab25bb"
    },
    "timestamp": {
      "us": 1727853720940194
    }
  },
  "fields": {
    "transaction.name.text": [
      "POST /intake/v2/events"
    ],
    "http.request.headers.Accept": [
      "*/*"
    ],
    "http.response.headers.Connection": [
      "Close"
    ],
    "transaction.representative_count": [
      1
    ],
    "user_agent.original.text": [
      "apm-agent-java/1.45.0 (autocode-app 2.0.0)"
    ],
    "client.geo.country_iso_code": [
      "BE"
    ],
    "host.hostname": [
      "apm-server-v2-apm-server-6f6db466bd-22ccb"
    ],
    "process.pid": [
      1
    ],
    "service.language.name": [
      "go"
    ],
    "transaction.result": [
      "HTTP 4xx"
    ],
    "cloud.availability_zone": [
      "europe-west3-a"
    ],
    "process.title.text": [
      "apm-server"
    ],
    "transaction.id": [
      "d91bb495a6ab25bb"
    ],
    "http.request.headers.Via": [
      "1.1 google"
    ],
    "http.request.method": [
      "POST"
    ],
    "processor.event": [
      "transaction"
    ],
    "labels.project": [
      "dummy-sub2"
    ],
    "source.ip": [
      "000.000.000.000"
    ],
    "agent.name": [
      "go"
    ],
    "host.name": [
      "apm-server-v2-apm-server-6f6db466bd-22ccb"
    ],
    "client.geo.country_name": [
      "Belgium"
    ],
    "event.agent_id_status": [
      "missing"
    ],
    "http.response.status_code": [
      400
    ],
    "http.version": [
      "1.1"
    ],
    "event.outcome": [
      "success"
    ],
    "user_agent.original": [
      "apm-agent-java/1.45.0 (autocode-app 2.0.0)"
    ],
    "cloud.region": [
      "europe-west3"
    ],
    "transaction.duration.us": [
      3499844
    ],
    "service.runtime.version": [
      "go1.22.7"
    ],
    "span.id": [
      "d91bb495a6ab25bb"
    ],
    "edp.domain": [
      "mydomain"
    ],
    "client.ip": [
      "000.000.000.000"
    ],
    "user_agent.name": [
      "Other"
    ],
    "data_stream.type": [
      "traces"
    ],
    "host.architecture": [
      "amd64"
    ],
    "http.request.headers.X-Forwarded-For": [
      "000.000.000.000,00.00.00.000"
    ],
    "cloud.provider": [
      "gcp"
    ],
    "timestamp.us": [
      1727853720940194
    ],
    "url.path": [
      "/intake/v2/events"
    ],
    "observer.type": [
      "apm-server"
    ],
    "observer.version": [
      "8.14.3"
    ],
    "agent.version": [
      "2.6.0"
    ],
    "transaction.name": [
      "POST /intake/v2/events"
    ],
    "process.title": [
      "apm-server"
    ],
    "client.geo.continent_name": [
      "Europe"
    ],
    "http.request.headers.Cache-Control": [
      "no-cache"
    ],
    "http.request.headers.Content-Encoding": [
      "deflate"
    ],
    "service.node.name": [
      "apm-server-v2-apm-server-6f6db466bd-22ccb"
    ],
    "http.request.headers.X-Forwarded-Proto": [
      "https"
    ],
    "url.scheme": [
      "https"
    ],
    "source.nat.ip": [
      "00.000.00.000"
    ],
    "transaction.sampled": [
      true
    ],
    "cloud.instance.id": [
      "6772592666386103052"
    ],
    "trace.id": [
      "d91bb495a6ab25bbdede8859c7cf8c8a"
    ],
    "client.geo.region_name": [
      "Brussels Capital"
    ],
    "client.geo.location": [
      {
        "coordinates": [
          4.347,
          50.8534
        ],
        "type": "Point"
      }
    ],
    "event.success_count": [
      1
    ],
    "transaction.span_count.dropped": [
      0
    ],
    "http.request.headers.Content-Type": [
      "application/x-ndjson"
    ],
    "url.full": [
      "https://mydomain/intake/v2/events"
    ],
    "http.request.headers.User-Agent": [
      "apm-agent-java/1.45.0 (autocode-app 2.0.0)"
    ],
    "service.environment": [
      "sandbox"
    ],
    "client.geo.city_name": [
      "Brussels"
    ],
    "client.geo.region_iso_code": [
      "BE-BRU"
    ],
    "service.name": [
      "apm-server"
    ],
    "data_stream.namespace": [
      "dummy_sub2"
    ],
    "service.runtime.name": [
      "gc"
    ],
    "process.args": [
      "apm-server",
      "run",
      "-e",
      "-c",
      "config/config-secret/apm-server.yml"
    ],
    "observer.hostname": [
      "apm-server-v2-apm-server-6f6db466bd-22ccb"
    ],
    "http.request.headers.X-Cloud-Trace-Context": [
      "61ce7649b7546eb12e48bd1b05dc5acf/7453092485412124311"
    ],
    "url.full.text": [
      "https://mydomain/intake/v2/events"
    ],
    "transaction.type": [
      "request"
    ],
    "transaction.span_count.started": [
      0
    ],
    "event.ingested": [
      "2024-10-02T07:22:05.000Z"
    ],
    "http.request.headers.Pragma": [
      "no-cache"
    ],
    "@timestamp": [
      "2024-10-02T07:22:00.940Z"
    ],
    "service.version": [
      "8.14.3"
    ],
    "host.os.platform": [
      "linux"
    ],
    "data_stream.dataset": [
      "apm"
    ],
    "http.response.headers.Content-Type": [
      "application/json"
    ],
    "http.response.headers.X-Content-Type-Options": [
      "nosniff"
    ],
    "service.language.version": [
      "go1.22.7"
    ],
    "url.domain": [
      "mydomain"
    ],
    "user_agent.device.name": [
      "Other"
    ],
    "cloud.instance.name": [
      "gke-dummy-sub-elastic-europe-w-generic-c4e5a328-8ipl"
    ],
    "cloud.project.id": [
      "or2-ms-dummy-sub-elastic-t1iylu"
    ]
  }
}
@up2neck up2neck added the bug label Oct 2, 2024
@inge4pres
Copy link
Contributor

hello @up2neck can you please give a bit more informations on the environment?
I see from the trace you're running APM Server v 8.14.3, is Elasticsearch deployed at the same version?
The trace body and log body you posted are from documents stored in Elasticsearch?

@up2neck
Copy link
Contributor Author

up2neck commented Oct 2, 2024

@inge4pres

Log entry was taken from APM Server logs (stdout, collected by GCP Loggins if it matters)
Trace was taken from Elasticsearch itself.

Our team, has 2 environments (both facing the issue):
GCP Load balancer -> APM Server (GKE, ECK) -> Elasticsearch cluster (GKE, ECK)
GCP Load balancer -> APM Server (GKE, ECK) -> Elasticsearch cluster (Elastic Cloud)

My thoughts are some middleware, probably, rewrites the exact error, which leads to incorrect log entry.

@inge4pres
Copy link
Contributor

Thanks for the additional info 🙏🏼

The reason I was asking is that you may have transforms or ingest pipelines in Elasticsearch that remaps fields, in this case the http.response.status_code field.
Do you know if there is special handling of this field?

My thoughts are some middleware, probably, rewrites the exact error, which leads to incorrect log entry.

I am not aware of anything inside apm-server binary modifying the value of this field, hence my confusion as to why it is different in logs than what's stored in ES as trace.
I'll verify this is bug and report back.

@up2neck
Copy link
Contributor Author

up2neck commented Oct 3, 2024

@inge4pres

There are few additional steps on the top of default "APM" integration-provided ingest pipelines, we have, but all of them don't interact with http.response.status_code field, keeps it original.

@inge4pres inge4pres self-assigned this Oct 3, 2024
@inge4pres inge4pres removed their assignment Oct 17, 2024
@inge4pres inge4pres self-assigned this Oct 28, 2024
@inge4pres
Copy link
Contributor

@up2neck I am back at looking into the code on the reported discrepancy between log entry and trace entry.
what I see is that the logger is fed with an response.errors.timeout keyword, and then we are logging the result as recorded by the HTTP request

https://github.com/inge4pres/apm-server/blob/9c7439f30167dcd7d4bd44e0473b0f249ce5fa8d/internal/beater/middleware/log_middleware.go#L101

so the message "request timed out" and a corresponding 503 http status code is coming from the HTTP request processing itself, there is no rewrite in the logger.

I also noticed that the request did time out, matching the event.duration and the trace content.

log event

  "event.duration": 3499790781,

trace event

    "transaction": {
      "result": "HTTP 4xx",
      "duration": {
        "us": 3499844
      },

Since I am noticing you have Tail Based Sampling enabled, would you please share the TBS configuration?
I am leaning towards the trace event being mixed up, rather than the log event being wrong.

@carsonip carsonip self-assigned this Nov 6, 2024
@up2neck
Copy link
Contributor Author

up2neck commented Nov 6, 2024

@inge4pres
One interesting thing, that we don't use any TBS configuration in our manifests - APM Server is deployed with ECK operator, including the following config customization only:

    config:
      apm-server:
        auth:
          anonymous:
            allow_agent:
            - rum-js
            - rum-js-dpeo
            - js-base
            - java
            - dotnet
            - php
            - opentelemetry/cpp
            - python
            - otlp
            - go
            - opentelemetry
            - opentelemetry/webjs
            - opentelemetry/js
            - opentelemetry/go
            - opentelemetry/java
            - opentelemetry/nodejs
            - opentelemetry/dotnet
            - nodejs
            - '@microlabs/otel-workers-sdk/js'
            enabled: true
            rate_limit:
              event_limit: 8000
              ip_limit: 1000
          api_key:
            enabled: false
            limit: 100
        capture_personal_data: true
        default_service_environment: undefined
        expvar.enabled: false
        host: 0.0.0.0:8200
        idle_timeout: 45s
        max_connections: 0
        max_event_size: 307200
        max_header_size: 1048576
        pprof.enabled: false
        read_timeout: 600s
        rum:
          allow_headers:
          - x-requested-with
          - access-control-request-private-network
          - access-control-allow-origin
          - xmlhttprequest
          - request-origin
          allow_origins:
          - '*'
          enabled: true
          exclude_from_grouping: ^/webpack
          library_pattern: node_modules|bower_components|~
        shutdown_timeout: 30s
        ssl:
          supported_protocols:
          - TLSv1.2
          - TLSv1.3
        write_timeout: 30s
      logging.level: warning
      monitoring.elasticsearch: {}

@inge4pres
Copy link
Contributor

mhmhm ok...
I thought TBS was on because I saw this in the trace body

    "transaction": {
      "result": "HTTP 4xx",
      "duration": {
        "us": 3499844
      },
      "representative_count": 1,
      "name": "POST /intake/v2/events",
      "id": "d91bb495a6ab25bb",
      "span_count": {
        "dropped": 0,
        "started": 0
      },
      "type": "request",
      "sampled": true
    },

that "sampled": true should be signaling sampling is on, but honestly I am not sure.

@axw
Copy link
Member

axw commented Nov 7, 2024

@inge4pres that's set also for head-based sampled transactions.

@up2neck
Copy link
Contributor Author

up2neck commented Nov 18, 2024

@inge4pres
Got it, I will disable sampling and check the log and trace correlation.

@inge4pres
Copy link
Contributor

inge4pres commented Dec 2, 2024

@up2neck let's recap what we have here.

Since I am unable to replicate the reported behavior in an Elastic Cloud ESS instance nor locally, please confirm my understanding below is correct.

A Java client app (client in the sense of APM client agent) sends an event into APM Server with an HTTP request that times out at ~3.5 seconds.
The timeout happens when APM Server is processing the request inside the /intake/v2/events handler, but we still get a valid trace stored in Elasticsearch.
This could be due to the request context expiring or being canceled while Elasticsearch is processing the document: the APM Server reports a failure, but the document is actually stored by ES at a later time.
The trace ID reported in the trace event (generated by the Java client) is persisted in Elasticsearch, together with the original HTTP status code 4xx. Normally, this event would not be logged on error. It does so because we have an error during processing the request.
The logged event reports a different status code than the original trace event (503 in the log event vs 400 in the trace event).

Is it plausible that, the connection from APM Server to Elasticsearch being interrupted exercises this behavior?
My assumption is that the trace represents the reply from APM Server HTTP server to java client, whereas the log message represents the processing of the event in the handler that failed due to timeout.

I am still puzzled to what's actually happening to be honest, because the log message reports a duration smaller than the trace by ~500 microseconds.

log event

  "event.duration": 3499790781,

trace event

    "event": {
      "agent_id_status": "missing",
      "ingested": "2024-10-02T07:22:05Z",
      "success_count": 1,
      "outcome": "success"
    },
    "transaction": {
      "result": "HTTP 4xx",
      "duration": {
        "us": 3499844
      },

The event.outcome field is set to "success", and event.ingested is the time the event was flushed to Elasticsearch ("ingested": "2024-10-02T07:22:05Z",), but the log message is prior to that ("@timestamp": "2024-10-02T07:22:04.439Z").
the trace event correctly reports its own timestamp ("@timestamp": "2024-10-02T07:22:00.940Z").

You said you fetched the trace document directly from ES, which makes sense in the scenario that I am imagining: trace event processed by APM in 3.5 seconds, responded with a 4xx to the client, ingestion into Elasticsearch times out or connection is truncated, log message produced for the same trace, event lands in ES anyway.

Do you have telemetry in your Kubernetes environment that is able to spot network connections being reset as part of pod to pod communication?
I honestly don't think we have an issue in apm-server as the logged message and the trace body are for 2 different events, AFAICT.

@carsonip
Copy link
Member

carsonip commented Dec 7, 2024

@up2neck thanks for the bug report. I managed to reproduce it.

apm-server request log

{"log.level":"error","@timestamp":"2024-12-07T00:18:06.391Z","log.logger":"request","log.origin":{"function":"github.com/elastic/apm-server/internal/beater/api.apmMiddleware.LogMiddleware.func1.1","file.name":"middleware/log_middleware.go","file.line":59},"message":"request timed out","service.name":"apm-server","url.original":"/intake/v2/events","http.request.method":"POST","user_agent.original":"curl/8.5.0","source.address":"127.0.0.1","trace.id":"7d605e1eeeed169c7580ea71d20cfa3c","transaction.id":"7d605e1eeeed169c","http.request.id":"7d605e1eeeed169c","event.duration":266192314,"http.request.body.bytes":303694,"http.response.status_code":503,"error.message":"request timed out","ecs.version":"1.6.0"}

apm-server self-instrumentation trace document

{
  "_index": ".ds-traces-apm-default-2024.12.06-000001",
  "_id": "qU57npMBrE4DWVdURCBD",
  "_version": 1,
  "_score": 0,
  "_source": {
    "agent": {
      "name": "go",
      "version": "2.6.0"
    },
    "process": {
      "args": [
        "/home/carson/.cache/JetBrains/GoLand2024.3/tmp/GoLand/___main_local_issue_14232",
        "-e",
        "-c",
        "apm-server-issue-14232.yml"
      ],
      "parent": {
        "pid": 21362
      },
      "pid": 1096402,
      "title": "___main_local_i"
    },
    "source": {
      "ip": "127.0.0.1"
    },
    "processor": {
      "event": "transaction"
    },
    "url": {
      "path": "/intake/v2/events",
      "scheme": "http",
      "port": 8200,
      "domain": "localhost",
      "full": "http://localhost:8200/intake/v2/events"
    },
    "observer": {
      "hostname": "carson-elastic",
      "type": "apm-server",
      "version": "8.14.3"
    },
    "trace": {
      "id": "7d605e1eeeed169c7580ea71d20cfa3c"
    },
    "@timestamp": "2024-12-07T00:18:06.125Z",
    "data_stream": {
      "namespace": "default",
      "type": "traces",
      "dataset": "apm"
    },
    "service": {
      "node": {
        "name": "carson-elastic"
      },
      "name": "apm-server",
      "runtime": {
        "name": "gc",
        "version": "go1.23.0"
      },
      "language": {
        "name": "go",
        "version": "go1.23.0"
      },
      "version": "8.14.3"
    },
    "host": {
      "hostname": "carson-elastic",
      "os": {
        "platform": "linux"
      },
      "name": "carson-elastic",
      "architecture": "amd64"
    },
    "client": {
      "ip": "127.0.0.1"
    },
    "http": {
      "request": {
        "headers": {
          "Accept": [
            "*/*"
          ],
          "User-Agent": [
            "curl/8.5.0"
          ],
          "Content-Length": [
            "303694"
          ],
          "Content-Type": [
            "application/x-ndjson"
          ]
        },
        "method": "POST"
      },
      "response": {
        "headers": {
          "X-Content-Type-Options": [
            "nosniff"
          ],
          "Connection": [
            "Close"
          ],
          "Content-Type": [
            "application/json"
          ]
        },
        "status_code": 400
      },
      "version": "1.1"
    },
    "event": {
      "success_count": 1,
      "outcome": "success"
    },
    "transaction": {
      "result": "HTTP 4xx",
      "duration": {
        "us": 266341
      },
      "representative_count": 1,
      "name": "POST /intake/v2/events",
      "id": "7d605e1eeeed169c",
      "span_count": {
        "dropped": 0,
        "started": 11
      },
      "type": "request",
      "sampled": true
    },
    "user_agent": {
      "original": "curl/8.5.0",
      "name": "curl",
      "device": {
        "name": "Other"
      },
      "version": "8.5.0"
    },
    "span": {
      "id": "7d605e1eeeed169c"
    },
    "timestamp": {
      "us": 1733530686125111
    }
  },
  "fields": {
    "http.request.headers.Content-Length": [
      "303694"
    ],
    "transaction.name.text": [
      "POST /intake/v2/events"
    ],
    "http.request.headers.Accept": [
      "*/*"
    ],
    "http.response.headers.Connection": [
      "Close"
    ],
    "transaction.representative_count": [
      1
    ],
    "user_agent.original.text": [
      "curl/8.5.0"
    ],
    "process.parent.pid": [
      21362
    ],
    "host.hostname": [
      "carson-elastic"
    ],
    "process.pid": [
      1096402
    ],
    "service.language.name": [
      "go"
    ],
    "transaction.result": [
      "HTTP 4xx"
    ],
    "process.title.text": [
      "___main_local_i"
    ],
    "transaction.id": [
      "7d605e1eeeed169c"
    ],
    "http.request.method": [
      "POST"
    ],
    "processor.event": [
      "transaction"
    ],
    "source.ip": [
      "127.0.0.1"
    ],
    "agent.name": [
      "go"
    ],
    "host.name": [
      "carson-elastic"
    ],
    "user_agent.version": [
      "8.5.0"
    ],
    "http.response.status_code": [
      400
    ],
    "http.version": [
      "1.1"
    ],
    "event.outcome": [
      "success"
    ],
    "user_agent.original": [
      "curl/8.5.0"
    ],
    "transaction.duration.us": [
      266341
    ],
    "service.runtime.version": [
      "go1.23.0"
    ],
    "span.id": [
      "7d605e1eeeed169c"
    ],
    "client.ip": [
      "127.0.0.1"
    ],
    "user_agent.name": [
      "curl"
    ],
    "data_stream.type": [
      "traces"
    ],
    "host.architecture": [
      "amd64"
    ],
    "timestamp.us": [
      1733530686125111
    ],
    "url.path": [
      "/intake/v2/events"
    ],
    "observer.type": [
      "apm-server"
    ],
    "observer.version": [
      "8.14.3"
    ],
    "agent.version": [
      "2.6.0"
    ],
    "transaction.name": [
      "POST /intake/v2/events"
    ],
    "process.title": [
      "___main_local_i"
    ],
    "service.node.name": [
      "carson-elastic"
    ],
    "url.scheme": [
      "http"
    ],
    "transaction.sampled": [
      true
    ],
    "trace.id": [
      "7d605e1eeeed169c7580ea71d20cfa3c"
    ],
    "event.success_count": [
      1
    ],
    "transaction.span_count.dropped": [
      0
    ],
    "url.port": [
      8200
    ],
    "http.request.headers.Content-Type": [
      "application/x-ndjson"
    ],
    "url.full": [
      "http://localhost:8200/intake/v2/events"
    ],
    "http.request.headers.User-Agent": [
      "curl/8.5.0"
    ],
    "service.name": [
      "apm-server"
    ],
    "data_stream.namespace": [
      "default"
    ],
    "service.runtime.name": [
      "gc"
    ],
    "process.args": [
      "/home/carson/.cache/JetBrains/GoLand2024.3/tmp/GoLand/___main_local_issue_14232",
      "-e",
      "-c",
      "apm-server-issue-14232.yml"
    ],
    "observer.hostname": [
      "carson-elastic"
    ],
    "url.full.text": [
      "http://localhost:8200/intake/v2/events"
    ],
    "transaction.type": [
      "request"
    ],
    "transaction.span_count.started": [
      11
    ],
    "@timestamp": [
      "2024-12-07T00:18:06.125Z"
    ],
    "service.version": [
      "8.14.3"
    ],
    "host.os.platform": [
      "linux"
    ],
    "data_stream.dataset": [
      "apm"
    ],
    "http.response.headers.Content-Type": [
      "application/json"
    ],
    "http.response.headers.X-Content-Type-Options": [
      "nosniff"
    ],
    "service.language.version": [
      "go1.23.0"
    ],
    "url.domain": [
      "localhost"
    ],
    "user_agent.device.name": [
      "Other"
    ]
  }
}

Notice the same trace id 7d605e1eeeed169c7580ea71d20cfa3c.

The setup I use to reproduce this is

  • local elasticsearch 8.14.3
    • with 1g heap, to slow indexing
  • local apm-server 8.14.3
    • self instrumentation enabled
    • create an indexing bottleneck in apm-server by setting output.elasticsearch.flush_bytes to a low value, e.g. 1b, where a min of 24kb will kick in
    • low max_event_size e.g. 10000
  • a lot of go agents sending high traffic to apm-server
  • a curl command to intake v2 endpoint with lower timeout --max-time and a payload containing an oversized event in the middle, e.g.
curl --location 'http://localhost:8200/intake/v2/events' --header 'Content-Type: application/x-ndjson' --data-binary "@events2.ndjson" -v --max-time 1

The effect of this is that while the request from curl will be seen and logged as 503 by apm-server because the client (curl) terminated the conection, the actual request response, if the client was alive, would be 400, because of a line exceeding the max_event_size limit.

As suspected in the initial bug report, this is caused by intake v2 endpoint middleware order and response code manipulation.

As for the actual scenario that caused what @up2neck observed, we'll need to dig deeper for a more plausible root cause. I have only tried with max_event_size, which may not be realistic as java agent might not send such a large event, and apm-server have a high enough default max_event_size.

@carsonip carsonip self-assigned this Dec 7, 2024
@carsonip
Copy link
Member

carsonip commented Dec 9, 2024

I've updated the above comment to avoid some confusion.

TLDR of the issue is that because of ES backpressure, APM agent request times out and terminates the connection. apm-server TimeoutMiddleware thinks that ctx is canceled, LogMiddleware logs it as 503, writeStreamResult changes it to 400 because of an invalid line. Therefore, self instrumentation tracer thinks the response is 400, while it is logged as 503 and nothing is actually returned to the agent.

The issue is now clear and should be ready for the team to move forward to fix the bug.

@carsonip carsonip changed the title Confusing log events for invalid requests Intake v2 request timeout causes inconsistent log and self-instrumented trace Dec 9, 2024
@up2neck
Copy link
Contributor Author

up2neck commented Dec 13, 2024

@carsonip @inge4pres
I collected some more examples of non-consistent logs and traces. Unfortunately, most of the payload either obfuscated or removed due to the company privacy policy. But even though, hope it would be useful.

evidence_sanitized.zip

@inge4pres
Copy link
Contributor

I collected some more examples of non-consistent logs and traces.

@up2neck thanks a lot for that 🙏🏼

@carsonip
Copy link
Member

carsonip commented Dec 30, 2024

@inge4pres raised some questions about my explanation and we managed to get to the bottom of it. I've crossed out some incorrect parts in my previous comment. I had a wrong assumption on the actual middleware orders, although I managed to reproduce the problem correctly.

The real issue here is that the 4xx was set (setting c.Result) and sent (using c.WriteResult()) to the client (which may or may not be disconnected) via writeResult, and therefore is intercepted by apmhttp (Elastic APM agent from self instrumentation). On the other hand, timeout and log middlewares read and write c.Result but never use c.WriteResult() to send out the modified result. These add up to the discrepancy in status code where the status code apparently managed to "skip" some middlewares.

In other words, timeout middleware is missing c.WriteResult, but even if that exists, it would conflict with the c.WriteResult inside the intakev2 handler writeStreamResult.

Image

@inge4pres and I are working on a fix.

@carsonip carsonip assigned carsonip and unassigned endorama Dec 30, 2024
@up2neck
Copy link
Contributor Author

up2neck commented Jan 1, 2025

@carsonip

Great news! Thank you for detailed explanation, it's very useful to understand some parts of APM Server and how them work.

@up2neck
Copy link
Contributor Author

up2neck commented Jan 6, 2025

@carsonip
I got a though crossed my mind - may such behavior affect APM Server metrics? Cause it potentially may has worse outcomes for end-users, if metrics aren't relevant.

@1pkg
Copy link
Member

1pkg commented Jan 7, 2025

Hello @up2neck, for your reference the fix for this issue is merged in #15123 and should be distributed as a part of the next patch release.

@up2neck
Copy link
Contributor Author

up2neck commented Jan 7, 2025

@1pkg
Hi! Thank you, I'm looking forward to the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment