Thanos failing to query data due to memcached #797

tumido · 2021-05-03T17:06:41Z

Grafana reporting error:

<html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html>

Thanos logs:

level=error ts=2021-05-03T15:10:51.072919873Z caller=handler.go:331 component=receive component=receive-handler err= msg="internal server error"

Thanos shard logs:

level=warn ts=2021-05-03T15:01:17.170470941Z caller=memcached_client.go:382 msg="failed to fetch items from memcached" numKeys=1 firstKey=attrs:01F4J99JM1SB8WQZPN1W6G42N6/chunks/000001 err="read tcp 10.130.3.22:58630->10.130.2.59:11211: i/o timeout"

Memcached logs:

Failed to write, and not due to blocking: Broken pipe

The text was updated successfully, but these errors were encountered:

tumido · 2021-05-03T17:06:58Z

Extracted from: https://github.com/operate-first/SRE/issues/280

tumido · 2021-05-03T17:08:52Z

This is a separate issue from 280. Trying to upscale memcached pod.

tumido · 2021-05-03T17:20:06Z

* failed to execute action (CreateOrUpdate): discovering resource information failed for  in : groupVersion shouldn't be empty
* failed to execute action (CreateOrUpdate): Deployment.apps "opf-observatorium-thanos-query-frontend" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/component":"query-cache", "app.kubernetes.io/instance":"opf-observatorium", "app.kubernetes.io/name":"thanos-query-frontend"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
* failed to execute action (CreateOrUpdate): StatefulSet.apps "opf-observatorium-thanos-rule" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden
* failed to execute action (CreateOrUpdate): StatefulSet.apps "opf-observatorium-thanos-compact" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden
* failed to execute action (CreateOrUpdate): Deployment.apps "opf-observatorium-loki-query-frontend" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/component":"query-frontend", "app.kubernetes.io/instance":"observatorium-xyz", "app.kubernetes.io/name":"loki", "app.kubernetes.io/part-of":"observatorium"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
* failed to execute action (CreateOrUpdate): StatefulSet.apps "opf-observatorium-thanos-store-shard-0" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden
* failed to execute action (CreateOrUpdate): Deployment.apps "opf-observatorium-thanos-receive-controller" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/component":"kubernetes-controller", "app.kubernetes.io/instance":"opf-observatorium", "app.kubernetes.io/name":"thanos-receive-controller"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
* failed to execute action (CreateOrUpdate): Deployment.apps "opf-observatorium-observatorium-api" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/component":"api", "app.kubernetes.io/instance":"observatorium-xyz", "app.kubernetes.io/name":"observatorium-api", "app.kubernetes.io/part-of":"observatorium"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
* failed to execute action (CreateOrUpdate): Deployment.apps "opf-observatorium-thanos-query" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/component":"query-layer", "app.kubernetes.io/instance":"opf-observatorium", "app.kubernetes.io/name":"thanos-query"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
* failed to execute action (CreateOrUpdate): StatefulSet.apps "opf-observatorium-loki-querier" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden
* failed to execute action (CreateOrUpdate): Deployment.apps "opf-observatorium-loki-distributor" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/component":"distributor", "app.kubernetes.io/instance":"observatorium-xyz", "app.kubernetes.io/name":"loki", "app.kubernetes.io/part-of":"observatorium", "loki.grafana.com/gossip":"true"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
* failed to execute action (CreateOrUpdate): StatefulSet.apps "opf-observatorium-loki-ingester" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden
* failed to execute action (CreateOrUpdate): StatefulSet.apps "opf-observatorium-thanos-receive-default" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden

tumido · 2021-05-03T18:19:07Z

The currently deployed Observatorium version doesn't provide a way to scale up the default memory limit on memcached, we had to upgrade Observatorium to the latest image.

Upgrade wasn't smooth - we had to delete the PVCs and add a anyuid to the containers.

tumido · 2021-05-03T19:10:38Z

Memcached still OOMKilled when a big query is run.

HumairAK · 2021-05-03T19:13:05Z

So we're seeing issues on 3 fronts:

We are seeing 2 types of errors in odh-prometheus, 409 conflict errors (not sure where these are coming from. 500 internal errors, likely due to failures occurring in observatorium (see below)
Memchached in opf-observatorium, continues to cap out on memroy and crash, it doesn't seem to survive stress tests (and I use the word stress loosely here) very well, at 1GiB of memory it crashes.
Thanos Compact continues to fail it's liveness/readiness probe tests, we suspect this may be due to a short timeout of 1s, though we are not certain.

tumido · 2021-05-03T19:14:23Z

Prometheus errors:

ts=2021-05-03T19:13:10.535Z caller=dedupe.go:112 component=remote level=warn remote_name=645dc0 url=http://opf-observatorium-thanos-receive.opf-observatorium.svc.cluster.local:19291/api/v1/receive msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: "
ts=2021-05-03T19:13:22.845Z caller=dedupe.go:112 component=remote level=error remote_name=645dc0 url=http://opf-observatorium-thanos-receive.opf-observatorium.svc.cluster.local:19291/api/v1/receive msg="non-recoverable error" count=494 err="server returned HTTP status 409 Conflict: conflict"

HumairAK · 2021-05-03T19:28:55Z

Also it seems like the images being specified in the observatorium CR are not being respected.

tumido · 2021-05-04T13:14:40Z

Similar issues being handled elsewhere:

https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1386
Thanos Store consumes so much memory when I query metrics based on 5 days and beyond thanos-io/thanos#3450

Take away: lower memcached memory limit so the store is not hammered by big queries and avoid big queries.

We need to reach out to upstream to help us tune the setup. Maybe we're missing some rollup/downsamping settings somewhere.

4n4nd · 2021-05-07T17:50:54Z

Also it seems like the images being specified in the observatorium CR are not being respected.

I have created an issue upstream for this: observatorium/operator#67

4n4nd · 2021-05-07T17:53:22Z

For the time being, I have scaled down the Observatorium Operator and manually Updated deployments to use correct versions of the image (quay.io/thanos/thanos:v0.20.1).

HumairAK · 2021-07-07T12:27:04Z

can this be closed @4n4nd ?

4n4nd · 2021-07-07T13:10:30Z

the default image issue is still there in the operator, so let's keep this issue open to track it.

HumairAK · 2021-10-07T15:43:25Z

@4n4nd is this issue still relevant? can it be closed? iirc you mentioned you'll be making some changes to how we deploy observatorium for smaug.

4n4nd · 2021-10-07T16:05:00Z

yeah we can close this

tumido mentioned this issue May 3, 2021

fix: Upgrade observatorium to solve memcached OOMKilled operate-first/apps#592

Merged

sesheta closed this as completed in operate-first/apps#592 May 3, 2021

tumido reopened this May 3, 2021

4n4nd self-assigned this May 6, 2021

4n4nd closed this as completed Oct 7, 2021

durandom transferred this issue from operate-first/operations Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thanos failing to query data due to memcached #797

Thanos failing to query data due to memcached #797

tumido commented May 3, 2021

tumido commented May 3, 2021

tumido commented May 3, 2021

tumido commented May 3, 2021 •

edited

Loading

tumido commented May 3, 2021

tumido commented May 3, 2021

HumairAK commented May 3, 2021

tumido commented May 3, 2021

HumairAK commented May 3, 2021

tumido commented May 4, 2021

4n4nd commented May 7, 2021

4n4nd commented May 7, 2021

HumairAK commented Jul 7, 2021

4n4nd commented Jul 7, 2021

HumairAK commented Oct 7, 2021

4n4nd commented Oct 7, 2021

Thanos failing to query data due to memcached #797

Thanos failing to query data due to memcached #797

Comments

tumido commented May 3, 2021

tumido commented May 3, 2021

tumido commented May 3, 2021

tumido commented May 3, 2021 • edited Loading

tumido commented May 3, 2021

tumido commented May 3, 2021

HumairAK commented May 3, 2021

tumido commented May 3, 2021

HumairAK commented May 3, 2021

tumido commented May 4, 2021

4n4nd commented May 7, 2021

4n4nd commented May 7, 2021

HumairAK commented Jul 7, 2021

4n4nd commented Jul 7, 2021

HumairAK commented Oct 7, 2021

4n4nd commented Oct 7, 2021

tumido commented May 3, 2021 •

edited

Loading