[release/v1.0] [prometheus.remote_write] Add troubleshooting steps fo…

…r ooo errors (#1938) Co-authored-by: Clayton Cornell <[email protected]> Co-authored-by: Paulin Todev <[email protected]>
grafana · Oct 21, 2024 · 6ab9276 · 6ab9276
1 parent 682fd3e
commit 6ab9276
Showing 1 changed file with 31 additions and 0 deletions.
diff --git a/docs/sources/reference/components/prometheus/prometheus.remote_write.md b/docs/sources/reference/components/prometheus/prometheus.remote_write.md
@@ -404,6 +404,37 @@ prometheus.remote_write "default" {
 }
 ```
 
+## Troubleshooting
+
+### Out of order errors
+
+You may sometimes see an "out of order" error in the {{< param "PRODUCT_NAME" >}} log files.
+This means that {{< param "PRODUCT_NAME" >}} sent a metric sample that has an older timestamp than a sample that the database already ingested.
+If your database is Mimir, the exact name of the [Mimir error][mimir-ooo-err] is `err-mimir-sample-out-of-order`.
+
+The most common cause for this error is that there is more than one {{< param "PRODUCT_NAME" >}} instance scraping the same target.
+To troubleshoot, take the following steps in order:
+1. If you use clustering, check if the number of {{< param "PRODUCT_NAME" >}} instances changed at the time the error was logged.
+   This is the only situation in which it is normal to experience an out of order error.
+   The error would only happen for a short period, until the cluster stabilizes and all {{< param "PRODUCT_NAME" >}} instances have a new list of targets.
+   Since the time duration for the cluster to stabilize is expected to be much shorter than the scrape interval, this isn't a real problem.
+   If the out of order error you see is not related to scaling of clustered collectors, it must be investigated.
+1. Check if there are active {{< param "PRODUCT_NAME" >}} instances which should not be running.
+   There may be an older {{< param "PRODUCT_NAME" >}} instance that wasn't shut down before a new one was started.
+1. Inspect the configuration to see if there could be multiple {{< param "PRODUCT_NAME" >}} instances which scrape the same target.
+1. Inspect the WAL to see which {{< param "PRODUCT_NAME" >}} instance sent those metric samples.
+   The WAL is located in a directory set by the [run command][run-cmd] `--storage.path` argument.
+   You can use [Promtool][promtool] to inspect it and find out which metric series were sent by this {{< param "PRODUCT_NAME" >}} instance since the last WAL truncation event. 
+   For example:
+   ```
+   ./promtool tsdb dump --match='{__name__=\"otelcol_connector_spanmetrics_duration_seconds_bucket\", http_method=\"GET\", job=\"ExampleJobName\"' /path/to/wal/ 
+   ```
+
+[clustering]: ../../configure/clustering
+[mimir-ooo-err]: https://grafana.com/docs/mimir/latest/manage/mimir-runbooks/#err-mimir-sample-out-of-order
+[run-cmd]: ../../cli/run
+[promtool]: https://prometheus.io/docs/prometheus/latest/command-line/promtool/#promtool-tsdb
+
 ## Technical details
 
 `prometheus.remote_write` uses [snappy][] for compression.