Skip to content

Commit

Permalink
[release/v1.0] [prometheus.remote_write] Add troubleshooting steps fo…
Browse files Browse the repository at this point in the history
…r ooo errors (#1938)

Co-authored-by: Clayton Cornell <[email protected]>
Co-authored-by: Paulin Todev <[email protected]>
  • Loading branch information
3 people authored Oct 21, 2024
1 parent 682fd3e commit 6ab9276
Showing 1 changed file with 31 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -404,6 +404,37 @@ prometheus.remote_write "default" {
}
```

## Troubleshooting

### Out of order errors

You may sometimes see an "out of order" error in the {{< param "PRODUCT_NAME" >}} log files.
This means that {{< param "PRODUCT_NAME" >}} sent a metric sample that has an older timestamp than a sample that the database already ingested.
If your database is Mimir, the exact name of the [Mimir error][mimir-ooo-err] is `err-mimir-sample-out-of-order`.

The most common cause for this error is that there is more than one {{< param "PRODUCT_NAME" >}} instance scraping the same target.
To troubleshoot, take the following steps in order:
1. If you use clustering, check if the number of {{< param "PRODUCT_NAME" >}} instances changed at the time the error was logged.
This is the only situation in which it is normal to experience an out of order error.
The error would only happen for a short period, until the cluster stabilizes and all {{< param "PRODUCT_NAME" >}} instances have a new list of targets.
Since the time duration for the cluster to stabilize is expected to be much shorter than the scrape interval, this isn't a real problem.
If the out of order error you see is not related to scaling of clustered collectors, it must be investigated.
1. Check if there are active {{< param "PRODUCT_NAME" >}} instances which should not be running.
There may be an older {{< param "PRODUCT_NAME" >}} instance that wasn't shut down before a new one was started.
1. Inspect the configuration to see if there could be multiple {{< param "PRODUCT_NAME" >}} instances which scrape the same target.
1. Inspect the WAL to see which {{< param "PRODUCT_NAME" >}} instance sent those metric samples.
The WAL is located in a directory set by the [run command][run-cmd] `--storage.path` argument.
You can use [Promtool][promtool] to inspect it and find out which metric series were sent by this {{< param "PRODUCT_NAME" >}} instance since the last WAL truncation event.
For example:
```
./promtool tsdb dump --match='{__name__=\"otelcol_connector_spanmetrics_duration_seconds_bucket\", http_method=\"GET\", job=\"ExampleJobName\"' /path/to/wal/
```

[clustering]: ../../configure/clustering
[mimir-ooo-err]: https://grafana.com/docs/mimir/latest/manage/mimir-runbooks/#err-mimir-sample-out-of-order
[run-cmd]: ../../cli/run
[promtool]: https://prometheus.io/docs/prometheus/latest/command-line/promtool/#promtool-tsdb

## Technical details

`prometheus.remote_write` uses [snappy][] for compression.
Expand Down

0 comments on commit 6ab9276

Please sign in to comment.